CN107122382A - A kind of patent classification method based on specification - Google Patents
A kind of patent classification method based on specification Download PDFInfo
- Publication number
- CN107122382A CN107122382A CN201710082677.8A CN201710082677A CN107122382A CN 107122382 A CN107122382 A CN 107122382A CN 201710082677 A CN201710082677 A CN 201710082677A CN 107122382 A CN107122382 A CN 107122382A
- Authority
- CN
- China
- Prior art keywords
- mrow
- msub
- classification
- mfrac
- ipc
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2413—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
- G06F18/24147—Distances to closest patterns, e.g. nearest neighbour classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Databases & Information Systems (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of patent classification method based on specification, belong to text-processing and Data Mining.Text Pretreatment is carried out to patent specification first;Thereafter inverted index file is built, the feature selection approach being combined using information gain and word frequency is come selected characteristic word;The TF IDF formula improved further utilized calculate term weight function, and build patent characteristic vector;Then training patent field set is built;Finally patent is classified using the KNN graders optimized.The research provides new thinking for patent document classification, is also laid a good foundation for further research patent document intelligent retrieval etc..
Description
Technical field
The invention belongs to computer analytical technology patent document application, and in particular to one kind utilizes patent specification
Patent classification method.
Background technology
Patent is technological innovation and the specific manifestation of enterprise value, be knowledge development and innovation important carrier, achievement and
One of source, many innovation and creation achievements are only come across in patent document.Counted according to World Intellectual Property Organization (WIPO), the world
The 70%~90% of upper invention achievement is primarily occur inned in patent document, rather than in the document of other carriers such as magazine, paper.
In addition, the interests in order to protect itself, enterprise as early as possible can apply for a patent, and often concentrate the most active and first in patent
The technology entered, contain in the world 90%~95% technical information.Simultaneously for the convenience of examination, patent document is often write
Comparison in detail, for other kinds of data, patent document can provide more information, be a kind of most common
Achievements of technology innovation, records the complete procedure of Patent Activities.It not only reflects the present situation of technical activity in each technical field,
And the developing history of technical activity in some particular technology area can be embodied.Applied for a patent in patent document containing each
Innovation and creation particular technique solution, there is very important effect for enterprise innovation, not only make enterprise can be with
Solve newest scientific research dynamic, it is to avoid repeat to study, search time and research funding are saved, while can also edify business research personnel's
Open one's minds, improve the starting point of innovation, use for reference conventional invention, greatly shorten research work progress.
With continuing to bring out for China's recent studies on achievement and innovation and creation, patent numbers show quick growth.Cut-off
On October 5th, 2016, the patent of invention number that China has announced is more than 5,980,000, wherein mandate patent of invention sum is
223.850 ten thousand.If the mean size of each patent is 2M, the capacity of patent data is up to hundreds of TB.In order to scientifically
These patent documentation datas are managed, and also to quickly and easily retrieve Patents documents, the classification of patent document seems
It is particularly important.At present, most countries use International Patent Classification IPC (International Patent in the world
Classification) patent document is classified, IPC is according to five grade separations, i.e. portion (Section), major class
(Class), group (Subclass), main group (Main Grop), packet (Grop), its middle part is highest ranking in classification chart
Classification layer, it is different according to field, it is divided into eight big portion, is marked with the English alphabet of one, be A-H respectively, each part subordinate
Provided with multiple major classes, major class is made up of binary digit, the major class for having varying number below each portion.For example:G06F21/00 tables
Show that physics-electricity Digital data processing-prevents the safety device of protection computer, its part, program or data of unauthorized behavior.
As can be seen here, for or for the patent of invention that will announce, it is necessary to assign one or more corresponding
The classification number of classification number, such as patent of invention " guard method of private data in a kind of association rule mining " is G06F21/00.It is right
In will submit apply for a patent for, its classification number be it is unknown and it needs to be determined that, in this regard, way conventional at present is basis
The art or patent content of patent description object are determined, it is necessary to by associated specialist manual read's application, with special
Sharp applications sharply increase (annual patent application number is close to 1,000,000), and the method needs to expend substantial amounts of manpower and thing
Power, and the limitation of expert's its knowledge is also difficult to ensure that the uniformity and accuracy of classification results.Therefore, the present invention is proposed
A kind of patent classification method based on patent document specification, this method utilizes the information announced in patent of invention specification
Come structural classification device or classification function, and determine with this classification applied for a patent, be achieved in the automatic classification of patent.
The content of the invention
It is an object of the invention to can not substantially effectively utilize to have announced patent of invention for existing patent classification method
In specification information, propose a kind of patent classification method based on patent document specification, this method will make full use of public
The specification information and corresponding classification that cloth patent of invention is included come structural classification device or classification function, are determined with this
Submit the classification applied for a patent, and feature extraction and selection with regard to specification in construction process, carry in terms of the determination of grader
Corresponding optimal solution is gone out.
The technical solution adopted by the present invention is:Patent classification method based on patent document specification mainly includes following step
Suddenly:
(1) patent data is pre-processed
The collection of patent sample data, sample IPC, the extraction of specification, Chinese word segmentation, part-of-speech tagging.Remove explanation
Symbol, numeral (there is substantial amounts of paragraph label in specification) in book.Stop words, function word, connection are filtered out using canonical matching
The word little to patent classification use such as word, only retains the keywords such as noun, adjective, verb.
(2) inverted index file is built
Count word frequency, positional information, part of speech weight and the distribution between class information of each word, using these statistical values with
And patent text information, build inverted index file.
(3) patent text feature selecting
The feature selection approach being combined using information gain and word frequency is right come the characteristic value of word in calculation procedure (2)
Characteristic value sorts, and selects a number of Feature Words to characterize patent text.
If AijTo include Feature Words tiAnd belong to cjNumber of documents, BijTo include Feature Words tiAnd classification is not belonging to
cjNumber of documents, CijFor not comprising Feature Words tiAnd classification belongs to cjNumber of documents, DijFor not comprising Feature Words tiAnd
Classification belongs to not cjNumber of documents, then shown in the calculating of characteristic value such as formula (1).
Wherein, TF represents the influence degree that word frequency is selected for patent characteristic in patent.If m is total for classification in training patent
Number, NjRepresent cjPatent sum in class, TFjkRepresent Feature Words tiIn cjClass Patent PkIn word frequency, then TF calculating is such as
Shown in formula (2).
Degree of scatter of the IC representative feature words between classification in formula (1), more scattered explanation word is more without representativeness, value
Also it is just smaller.If TFj(ti) represent Feature Words tiIn class cjIn frequency, TF (ti) represent Feature Words tiTotal frequency,Table
Show Feature Words tiThe frequency average value occurred in all classes, then calculate as shown in formula (3).
(4) patent text vectorization
This step handle includes:
1. weight calculation, is calculated as shown in formula (4).
Wherein,Represent Feature Words t in textThe frequency of middle appearance, N represents all special in whole patent sample sets
The number of profit, n represents occur Feature Words t patent number, C in whole patent sample setstRepresent corresponding to Feature Words part of speech
Part of speech weight coefficient, PtRepresent the position weight coefficient of Feature Words.
2. sort, according to weight descending sort, construct the spatial model vector V of patent texti(wi1,wi2,...,win),
The content of each patent text is represented with this.
(5) each stratigraphic classification characteristic vectors of IPC are generated
This step includes:
1. the classification for the classification description of each subgroup being incorporated into institute owner group is described, and is carried out participle, is gone stop words to handle.
2. feature selecting is carried out after each main group of description is merged, the category feature vector of the small class hierarchies of IPC is constructed, to
Amount is expressed as { VA01B1/00,VA01B3/00,...,VH99Z99/00}.Wherein, A01B1/00 is first main group, H99Z99/00 in IPC
For main group of last in IPC.
3. feature selecting is carried out after all basic descriptions under same group are merged, the class of the big class hierarchies of IPC is constructed
Other characteristic vector, vector representation is { VA01B,VA01C,...,VH99Z}.Wherein, A01B is first group in IPC, and H99Z is IPC
In last group.
4. feature selecting is carried out after all basic descriptions under same major class are merged, the classification of construction IPC portions level is special
Vector is levied, vector representation is { VA01,VA21,...,VH99}.Wherein, A01 is first major class in IPC, and H99Z is last in IPC
One major class.
(6) patent sample neighborhood is built
This step includes:
1. the similarity in patent training set between each patent is calculated.Similarity can by calculate vector between angle more than
String is obtained.If sim (di,dj) represent patent text diWith djSimilarity, then shown in calculation formula such as formula (5).
Wherein, WikAnd WjkThe weight of the special testimony of correspondence in patent vector is represented, n represents the dimension of vector.
2. by diWith other all patent sample djSimilarity sort in descending order, the formation of K patent sample is gathered before selection
Di, DiReferred to as patent diNeighborhood, K value is determined on a case-by-case basis.
(7) patent Similarity Measure to be sorted
This step includes:
1. patent to be sorted illustrates the extraction of book, Chinese word segmentation, part-of-speech tagging, removes stop words.
2. patent characteristic selection and vectorization.
3. patent B to be sorted is calculatedjCharacteristic vector and the cosine similarity S of each IPC category features vectorai。
4. patent B to be sorted is calculatedjWith the cosine similarity S of each patent in patent training setbj。
5. above-mentioned training patent is pressed into Similarity value SbjDescending sort, the selection patent of foremost K is used as its neighborhood collection
Close.
(8) categorised decision
This step includes:
1. patent B to be sorted is calculatedjWith sample patent diBetween shared field size L (Bj,di), i.e., two field collection
The number of identical patent in conjunction.
2. the final Weighted Similarity between patent to be sorted and each IPC classification is calculated, shown in calculation formula such as formula (6).
Wherein, I represents classification, p, k, α, and in the case of β is adjustable parameter, system default, p is that 0.8, k is that 0.95, α is
0.6, β is 5.
3. patent to be sorted is included into the maximum classes of similarity S (i).
The main beneficial effect of the present invention is shown:
(1) in terms of patent text feature selecting
For the title and summary of patent, patent specification content is more enriched, comprising information content also more
Greatly.Also just because of this, substantial amounts of noise data is contained in patent specification, particularly to the classification of the following rank of IPC groups,
The analog information included between different patents is more, is unfavorable for classification.Therefore, present invention improves over the feature of patent specification
Extraction and the method for characteristic vector, reduce noise jamming, improve the nicety of grading of patent.
(2) patent classification method design aspect
Because patent data amount is quite huge, and patent classification is especially more, so as to cause disaggregated model training speed mistake
Slow the problems such as, hence it is evident that be not suitable for patent classification.Therefore, the present invention proposes a kind of new arest neighbors sorting algorithm, and dividing
IPC description informations are added in class process, the degree of accuracy of patent classification is further increased on the premise of classification speed is ensured.
Brief description of the drawings
Fig. 1 is the structured flowchart in the embodiment of the present invention
Fig. 2 is the construction flow of Patent vector space of the embodiment of the present invention
Fig. 3 is the classification process figure based on improvement KNN in the embodiment of the present invention
Embodiment
Below using patent document as embodiment, the patent classification method of the present invention is described in detail, specific implementation procedure is as follows:
Step 1:The data of patent text are obtained, Text Pretreatment is carried out to patent specification, mainly participle and go to stop
Word.
1. the description of IPC classifications is obtained, participle and part-of-speech tagging are carried out to description, goes stop words to handle, to word segmentation result
Carry out after artificial calibration, build user-oriented dictionary.
2. row format conversion, specification extraction are entered to the patent sample of above-mentioned extraction respectively, (1) is added in participle program
The user-oriented dictionary of middle structure, then carries out Chinese word segmentation, part-of-speech tagging to specification.
3. regular expression is utilized, stop words, function word, conjunction etc. is removed in patent specification to patent classification use not
Big word, only retains noun, adjective, verb.
Step 2:Word frequency, positional information, part of speech weight and the distribution between class information of each word are counted, these systems are utilized
Evaluation and patent text information, build inverted index file.
Inverted index file is built according to the word filtered out in step 1, indexed file structure includes vocabulary and thing
Part table, each vocabulary one event table of correspondence, word frequency of the patent No. that event table storage vocabulary occurs in the patent file,
Position weight and part of speech weight.Here position weight calculation formula is:Wherein n represents that vocabulary goes out in the description
Existing total degree, liRepresent that vocabulary ith occurs setting technical field weight 1, background technology in the weight of present position, example
0.8, other positions 0.5.Part of speech weight setting is noun 2.5, and verb and adjective are 1, and concrete outcome is as shown in table 1.
The user-oriented dictionary of table 1 and inverted index merge
Step 3:The feature selection approach being combined using information gain and word frequency calculates the characteristic value of word, to feature
Value sequence, selects a number of Feature Words to characterize patent text.
Because information gain has a low-frequency word defect, and applicant is in order to emphasize that it is special that an innovative point often repeats some
Different word, and these high frequency words are favourable for classification, therefore, the feature choosing that the present invention is combined using information gain and word frequency
Selection method, the characteristic value of word in each patent is calculated according to formula (1), then these words are dropped according to characteristic value first
Sequence sorts, and selects wherein preceding 20 words as the Feature Words of the patent.
If AijTo include Feature Words tiAnd belong to cjNumber of documents, BijTo include Feature Words tiAnd classification is not belonging to
cjNumber of documents, CijFor not comprising Feature Words tiAnd classification belongs to cjNumber of documents, DijFor not comprising Feature Words tiAnd
Classification belongs to not cjNumber of documents, then shown in the calculating of characteristic value such as formula (1).
Wherein, TF represents the influence degree that word frequency is selected for patent characteristic in patent.If m is total for classification in training patent
Number, NjRepresent cjPatent sum in class, TFjkRepresent Feature Words tiIn cjClass Patent PkIn word frequency, then TF calculating is such as
Shown in formula (2).
Degree of scatter of the IC representative feature words between classification in formula (1), more scattered explanation word is more without representativeness, value
Also it is just smaller.If TFj(ti) represent Feature Words tiIn class cjIn frequency, TF (ti) represent Feature Words tiTotal frequency,Table
Show Feature Words tiThe frequency average value occurred in all classes, then calculate as shown in formula (3).
Step 4:Using inverted index file, the weight of each patent characteristic word, the TF- improved then utilized are calculated
IDF formula calculate term weight function, finally build patent characteristic vector.
This step is specifically included:
1. weight calculation, is calculated as shown in formula (4).
Wherein,Represent Feature Words t in textThe frequency of middle appearance, N represents own in whole patent sample sets
The number of patent, n represents occur Feature Words t patent number, C in whole patent sample setstRepresent corresponding to Feature Words part of speech
Part of speech weight coefficient, PtRepresent the position weight coefficient of Feature Words.
2. sort, according to weight descending sort, construct the spatial model vector V of patent texti(wi1,wi2,...,win),
The content of each patent text is represented with this.
Word frequency, position weight, the part of speech weight of Feature Words are have recorded in inverted index file, so only needing to statistics
Equally there is the textual data of this feature word, be also known as total textual data, concrete outcome is as shown in table 2.
The patent characteristic of table 2 vector
Step 5:Each stratigraphic classification characteristic vectors of IPC are generated, on the basis of step 1, since group successively upwards, are calculated
Each vocabulary is in the class weight of correspondence level, and the calculating of weight uses TF-IDF, regard a classification description as a text,
Then the category feature vector of each level is built.
This step is specifically included:
1. the classification for the classification description of each subgroup being incorporated into institute owner group is described, and is carried out participle, is gone stop words to handle.
2. feature selecting is carried out after each main group of description is merged, the category feature vector of the small class hierarchies of IPC is constructed, to
Amount is expressed as { VA01B1/00,VA01B3/00,...,VH99Z99/00}.Wherein, A01B1/00 is first main group, H99Z99/00 in IPC
For main group of last in IPC.
3. feature selecting is carried out after all basic descriptions under same group are merged, the class of the big class hierarchies of IPC is constructed
Other characteristic vector, vector representation is { VA01B,VA01C,...,VH99Z}.Wherein, A01B is first group in IPC, and H99Z is IPC
In last group.
4. feature selecting is carried out after all basic descriptions under same major class are merged, the classification of construction IPC portions level is special
Vector is levied, vector representation is { VA01,VA21,...,VH99}.Wherein, A01 is first major class in IPC, and H99Z is last in IPC
One major class.
Such as, by all groups of vocabulary under A01B groups and into an A01B word finder, the group under other A01 major classes
Also it is in this way, then calculate the weight of each word in A01B word finders, finally to construct the characteristic vector of A01B groups.
Step 6:Patent sample neighborhood is built, using the patent characteristic vector in step 4, each patent is calculated special with other
These patent similarities are ranked up by similarity between profit, 100 maximum patents of selection wherein similarity, constitute this special
The Neighbourhood set of profit.
This step is specifically included:
1. the similarity in patent training set between each patent is calculated.Similarity can by calculate vector between angle more than
String is obtained.If sim (di,dj) represent patent text diWith djSimilarity, then shown in calculation formula such as formula (5).
Wherein, WikAnd WjkThe weight of the special testimony of correspondence in patent vector is represented, n represents the dimension of vector.
2. by diWith other all patent sample djSimilarity sort in descending order, the formation of K patent sample is gathered before selection
Di, DiReferred to as patent diNeighborhood, K value is determined on a case-by-case basis.
Concrete outcome is as shown in table 3.
The patent field set of table 3
Step 7:Calculate patent vector to be sorted and IPC category features vector and the cosine phase between training set patent
Like angle value, the Neighbourhood set of patent to be divided equally is calculated.
This step includes:
1. patent to be sorted is pre-processed, feature selecting, vectorization and Data Format Transform.
2. patent characteristic selection and vectorization.
3. patent B to be sorted is calculatedjCharacteristic vector and the cosine similarity S of each IPC category features vectorai。
4. patent B to be sorted is calculatedjWith the cosine similarity S of each patent in patent training setbj。
5. above-mentioned training patent is pressed into Similarity value SbjDescending sort, the selection patent of foremost K is used as its neighborhood collection
Close.
Step 8:Categorised decision, calculates the size that field is shared between patent to be sorted and training set Patent, i.e., first
Calculate the number of identical patent in Neighbourhood set.Then calculate Similarity-Weighted between patent to be divided and patent classification and, pair plus
After power and sequence, patent to be divided is divided into that maximum class of value.
This step is specifically included:
1. patent B to be sorted is calculatedjWith sample patent diBetween shared field size L (Bj,di), i.e., two field collection
The number of identical patent in conjunction.
2. the final Weighted Similarity between patent to be sorted and each IPC classification is calculated, shown in calculation formula such as formula (6).
Wherein, I represents classification, p, k, α, and in the case of β is adjustable parameter, system default, p is that 0.8, k is that 0.95, α is
0.6, β is 5.
3. patent to be sorted is included into the maximum classes of similarity S (i).
In the description of this specification, reference term " one embodiment ", " some embodiments ", " illustrative examples ",
The description of " example ", " specific example " or " some examples " etc. means to combine specific features, the knot that the embodiment or example are described
Structure, material or feature are contained at least one embodiment of the present invention or example.In this manual, to above-mentioned term
Schematic representation is not necessarily referring to identical embodiment or example.Moreover, specific features, structure, material or the spy of description
Point can in an appropriate manner be combined in any one or more embodiments or example.
Although an embodiment of the present invention has been shown and described, it will be understood by those skilled in the art that:Not
In the case of departing from the principle and objective of the present invention a variety of change, modification, replacement and modification can be carried out to these embodiments, this
The scope of invention is limited by claim and its equivalent.
Claims (8)
1. a kind of patent classification method based on specification, it is characterised in that comprise the following steps:
Step 1, the data of patent text are obtained, Text Pretreatment is carried out to patent specification;
Step 2, word frequency, positional information, part of speech weight and the distribution between class information of each word are counted, these statistical values are utilized
And patent text information, build inverted index file;
Step 3, the feature selection approach being combined using information gain and word frequency calculates the characteristic value of word, and characteristic value is arranged
Sequence, selects a number of Feature Words to characterize patent text;
Step 4, using inverted index file, the weight of each patent characteristic word, the TF-IDF improved then utilized are calculated
Formula calculates term weight function, finally builds patent characteristic vector;
Step 5, each stratigraphic classification characteristic vectors of IPC are generated, on the basis of step 1, since group successively upwards, calculate each
Vocabulary is in the class weight of correspondence level, and the calculating of weight uses TF-IDF, regards a classification description as a text, then
Build the category feature vector of each level;
Step 6, build patent sample neighborhood, using the patent characteristic vector in step 4, calculate each patent and other patents it
Between similarity, these patent similarities are ranked up, K maximum patent of selection wherein similarity constitutes the neighbour of the patent
Gather in domain;
Step 7, patent vector to be sorted and IPC category features vector and the cosine similarity between training set patent are calculated
Value, equally calculates the Neighbourhood set of patent to be divided;
Step 8, the size that field is shared between patent to be sorted and training set Patent is calculated first, that is, is calculated in Neighbourhood set
The number of identical patent.Then calculate Similarity-Weighted between patent to be divided and patent classification and, after being sorted to weighted sum, will treat
Point patent is divided into that maximum class of value.
2. a kind of patent classification method based on specification according to claim 1, it is characterised in that:The step 1 has
Body includes:
The collection of patent sample data, sample IPC, the extraction of specification, Chinese word segmentation, part-of-speech tagging.Remove in specification
Symbol, numeral;The word little to patent classification use such as stop words, function word, conjunction is filtered out using canonical matching, is only protected
Leave behind a good reputation the keywords such as word, adjective, verb.
3. a kind of patent classification method based on specification according to claim 1, it is characterised in that:In the step 3
The calculating process of characteristic value is:
If AijTo include Feature Words tiAnd belong to cjNumber of documents, BijTo include Feature Words tiAnd classification is not belonging to cj's
Number of documents, CijFor not comprising Feature Words tiAnd classification belongs to cjNumber of documents, DijFor not comprising Feature Words tiAnd class
Do not belong to not cjNumber of documents, then shown in the calculating of characteristic value such as formula (1):
<mrow>
<mi>G</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>t</mi>
<mi>i</mi>
</msub>
<mo>)</mo>
</mrow>
<mo>=</mo>
<mi>T</mi>
<mi>F</mi>
<mo>&times;</mo>
<mi>I</mi>
<mi>C</mi>
<mo>&times;</mo>
<mo>(</mo>
<mo>{</mo>
<mfrac>
<mrow>
<msub>
<mi>A</mi>
<mrow>
<mi>i</mi>
<mi>j</mi>
</mrow>
</msub>
<mo>+</mo>
<msub>
<mi>B</mi>
<mrow>
<mi>i</mi>
<mi>j</mi>
</mrow>
</msub>
</mrow>
<mi>N</mi>
</mfrac>
<mo>&lsqb;</mo>
<msubsup>
<mo>&Sigma;</mo>
<mrow>
<mi>j</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<mi>m</mi>
</msubsup>
<mfrac>
<msub>
<mi>A</mi>
<mrow>
<mi>i</mi>
<mi>j</mi>
</mrow>
</msub>
<mrow>
<msub>
<mi>A</mi>
<mrow>
<mi>i</mi>
<mi>j</mi>
</mrow>
</msub>
<mo>+</mo>
<msub>
<mi>B</mi>
<mrow>
<mi>i</mi>
<mi>j</mi>
</mrow>
</msub>
</mrow>
</mfrac>
<mi>log</mi>
<mfrac>
<msub>
<mi>A</mi>
<mrow>
<mi>i</mi>
<mi>j</mi>
</mrow>
</msub>
<mrow>
<msub>
<mi>A</mi>
<mrow>
<mi>i</mi>
<mi>j</mi>
</mrow>
</msub>
<mo>+</mo>
<msub>
<mi>B</mi>
<mrow>
<mi>i</mi>
<mi>j</mi>
</mrow>
</msub>
</mrow>
</mfrac>
<mo>&rsqb;</mo>
<mo>}</mo>
<mo>+</mo>
<mo>{</mo>
<mfrac>
<mrow>
<msub>
<mi>C</mi>
<mrow>
<mi>i</mi>
<mi>j</mi>
</mrow>
</msub>
<mo>+</mo>
<msub>
<mi>D</mi>
<mrow>
<mi>i</mi>
<mi>j</mi>
</mrow>
</msub>
</mrow>
<mi>N</mi>
</mfrac>
<mo>&lsqb;</mo>
<msubsup>
<mo>&Sigma;</mo>
<mrow>
<mi>j</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<mi>m</mi>
</msubsup>
<mfrac>
<msub>
<mi>C</mi>
<mrow>
<mi>i</mi>
<mi>j</mi>
</mrow>
</msub>
<mrow>
<msub>
<mi>C</mi>
<mrow>
<mi>i</mi>
<mi>j</mi>
</mrow>
</msub>
<mo>+</mo>
<msub>
<mi>D</mi>
<mrow>
<mi>i</mi>
<mi>j</mi>
</mrow>
</msub>
</mrow>
</mfrac>
<mi>log</mi>
<mfrac>
<msub>
<mi>C</mi>
<mrow>
<mi>i</mi>
<mi>j</mi>
</mrow>
</msub>
<mrow>
<msub>
<mi>C</mi>
<mrow>
<mi>i</mi>
<mi>j</mi>
</mrow>
</msub>
<mo>+</mo>
<msub>
<mi>D</mi>
<mrow>
<mi>i</mi>
<mi>j</mi>
</mrow>
</msub>
</mrow>
</mfrac>
<mo>&rsqb;</mo>
<mo>}</mo>
<mo>-</mo>
<mo>-</mo>
<mo>-</mo>
<mrow>
<mo>(</mo>
<mn>1</mn>
<mo>)</mo>
</mrow>
</mrow>
Wherein, TF represents the influence degree that word frequency is selected for patent characteristic in patent;If m is classification sum, N in training patentj
Represent cjPatent sum in class, TFjkRepresent Feature Words tiIn cjClass Patent PkIn word frequency, then TF calculating such as formula (2) institute
Show:
<mrow>
<mi>T</mi>
<mi>F</mi>
<mo>=</mo>
<munder>
<mrow>
<mi>m</mi>
<mi>a</mi>
<mi>x</mi>
</mrow>
<mrow>
<mn>1</mn>
<mo><</mo>
<mi>j</mi>
<mo><</mo>
<mi>m</mi>
</mrow>
</munder>
<mo>{</mo>
<msqrt>
<mfrac>
<mrow>
<munderover>
<mo>&Sigma;</mo>
<mrow>
<mi>k</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<msub>
<mi>N</mi>
<mi>j</mi>
</msub>
</munderover>
<msup>
<mrow>
<mo>(</mo>
<msub>
<mi>TF</mi>
<mrow>
<mi>j</mi>
<mi>k</mi>
</mrow>
</msub>
<mo>(</mo>
<msub>
<mi>t</mi>
<mi>i</mi>
</msub>
<mo>)</mo>
<mo>)</mo>
</mrow>
<mn>2</mn>
</msup>
</mrow>
<mrow>
<munderover>
<mo>&Sigma;</mo>
<mrow>
<mi>j</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<mi>m</mi>
</munderover>
<munderover>
<mo>&Sigma;</mo>
<mrow>
<mi>k</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<msub>
<mi>N</mi>
<mi>j</mi>
</msub>
</munderover>
<msup>
<mrow>
<mo>(</mo>
<msub>
<mi>TF</mi>
<mrow>
<mi>j</mi>
<mi>k</mi>
</mrow>
</msub>
<mo>(</mo>
<msub>
<mi>t</mi>
<mi>i</mi>
</msub>
<mo>)</mo>
<mo>)</mo>
</mrow>
<mn>2</mn>
</msup>
</mrow>
</mfrac>
</msqrt>
<mo>}</mo>
<mo>-</mo>
<mo>-</mo>
<mo>-</mo>
<mrow>
<mo>(</mo>
<mn>2</mn>
<mo>)</mo>
</mrow>
</mrow>
Degree of scatter of the IC representative feature words between classification in formula (1), more scattered explanation word is worth also just more without representativeness
It is smaller;If TFj(ti) represent Feature Words tiIn class cjIn frequency, TF (ti) represent Feature Words tiTotal frequency,Represent special
Levy word tiThe frequency average value occurred in all classes, then calculate as shown in formula (3):
<mrow>
<mi>I</mi>
<mi>C</mi>
<mo>=</mo>
<mfrac>
<msqrt>
<mrow>
<munderover>
<mo>&Sigma;</mo>
<mrow>
<mi>j</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<mi>m</mi>
</munderover>
<msup>
<mrow>
<mo>(</mo>
<msub>
<mi>TF</mi>
<mi>j</mi>
</msub>
<mo>(</mo>
<msub>
<mi>t</mi>
<mi>i</mi>
</msub>
<mo>)</mo>
<mo>-</mo>
<mover>
<mrow>
<mi>T</mi>
<mi>F</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>t</mi>
<mi>i</mi>
</msub>
<mo>)</mo>
</mrow>
</mrow>
<mo>&OverBar;</mo>
</mover>
<mo>)</mo>
</mrow>
<mn>2</mn>
</msup>
</mrow>
</msqrt>
<mrow>
<mi>T</mi>
<mi>F</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>t</mi>
<mi>i</mi>
</msub>
<mo>)</mo>
</mrow>
</mrow>
</mfrac>
<mo>-</mo>
<mo>-</mo>
<mo>-</mo>
<mrow>
<mo>(</mo>
<mn>3</mn>
<mo>)</mo>
</mrow>
</mrow>
4. a kind of patent classification method based on specification according to claim 1, it is characterised in that:The step 4
Detailed process is:
Step 4.1, weight calculation, is calculated as shown in formula (4).
Wherein,Represent Feature Words t in textThe frequency of middle appearance, N represents all patents in whole patent sample sets
Number, n represents occur Feature Words t patent number, C in whole patent sample setstRepresent the part of speech corresponding to Feature Words part of speech
Weight coefficient, PtRepresent the position weight coefficient of Feature Words;
Step 4.2, sort, according to weight descending sort, construct the spatial model vector V of patent texti(wi1,wi2,...,win),
The content of each patent text is represented with this.
5. a kind of patent classification method based on specification according to claim 1, it is characterised in that:The step 5
Detailed process is:
Step 5.1, the classification for the classification description of each subgroup being incorporated into institute owner group is described, and is carried out participle, is gone stop words to handle;
Step 5.2, feature selecting is carried out after each main group of description is merged, the category feature vector of the small class hierarchies of IPC is constructed,
Vector representation is { VA01B1/00,VA01B3/00,...,VH99Z99/00};Wherein, A01B1/00 is first main group, H99Z99/ in IPC
00 is last main group in IPC;
Step 5.3, feature selecting is carried out after all basic descriptions under same group are merged, the construction big class hierarchies of IPC
Category feature vector, vector representation is { VA01B,VA01C,...,VH99Z}.Wherein, A01B is first group in IPC, and H99Z is
The group of last in IPC;
Step 5.4, feature selecting, the classification of construction IPC portions level are carried out after all basic descriptions under same major class are merged
Characteristic vector, vector representation is { VA01,VA21,...,VH99};Wherein, A01 is first major class in IPC, H99Z be in IPC most
Latter major class.
6. a kind of patent classification method based on specification according to claim 1, it is characterised in that:The step 6
Detailed process is:
Step 6.1, the similarity in patent training set between each patent is calculated;Similarity can be by calculating the angle between vector
Cosine is obtained;If sim (di,dj) represent patent text diWith djSimilarity, then shown in calculation formula such as formula (5):
<mrow>
<mi>s</mi>
<mi>i</mi>
<mi>m</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>d</mi>
<mi>i</mi>
</msub>
<mo>,</mo>
<msub>
<mi>d</mi>
<mi>j</mi>
</msub>
<mo>)</mo>
</mrow>
<mo>=</mo>
<mfrac>
<mrow>
<munderover>
<mo>&Sigma;</mo>
<mrow>
<mi>k</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<mi>n</mi>
</munderover>
<msub>
<mi>W</mi>
<mrow>
<mi>i</mi>
<mi>k</mi>
</mrow>
</msub>
<mo>&times;</mo>
<msub>
<mi>W</mi>
<mrow>
<mi>j</mi>
<mi>k</mi>
</mrow>
</msub>
</mrow>
<msqrt>
<mrow>
<munderover>
<mo>&Sigma;</mo>
<mrow>
<mi>k</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<mi>n</mi>
</munderover>
<msubsup>
<mi>W</mi>
<mrow>
<mi>i</mi>
<mi>k</mi>
</mrow>
<mn>2</mn>
</msubsup>
<mo>&times;</mo>
<munderover>
<mo>&Sigma;</mo>
<mrow>
<mi>k</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<mi>n</mi>
</munderover>
<msubsup>
<mi>W</mi>
<mrow>
<mi>j</mi>
<mi>k</mi>
</mrow>
<mn>2</mn>
</msubsup>
</mrow>
</msqrt>
</mfrac>
<mo>-</mo>
<mo>-</mo>
<mo>-</mo>
<mrow>
<mo>(</mo>
<mn>5</mn>
<mo>)</mo>
</mrow>
</mrow>
Wherein, WikAnd WjkThe weight of the special testimony of correspondence in patent vector is represented, n represents the dimension of vector;
Step 6.2, by diWith other all patent sample djSimilarity sort in descending order, the formation of K patent sample collects before selection
Close Di, DiReferred to as patent diNeighborhood, K value is determined on a case-by-case basis.
7. a kind of patent classification method based on specification according to claim 1, it is characterised in that:The step 7
Detailed process is:
Step 7.1, patent to be sorted illustrates the extraction of book, Chinese word segmentation, part-of-speech tagging, removes stop words;
Step 7.2, patent characteristic selection and vectorization;
Step 7.3, patent B to be sorted is calculatedjCharacteristic vector and the cosine similarity S of each IPC category features vectorai;
Step 7.4, patent B to be sorted is calculatedjWith the cosine similarity S of each patent in patent training setbj;
Step 7.5, above-mentioned training patent is pressed into Similarity value SbjDescending sort, the selection patent of foremost K is used as its neighborhood
Set.
8. a kind of patent classification method based on specification according to claim 1, it is characterised in that:The step 8
Detailed process is:
Step 8.1, patent B to be sorted is calculatedjWith sample patent diBetween shared field size L (Bj,di), i.e., two field collection
The number of identical patent in conjunction;
Step 8.2, the final Weighted Similarity between patent to be sorted and each IPC classification, calculation formula such as formula (6) institute are calculated
Show:
<mrow>
<mi>S</mi>
<mrow>
<mo>(</mo>
<mi>i</mi>
<mo>)</mo>
</mrow>
<mo>=</mo>
<msub>
<mi>S</mi>
<mrow>
<mi>a</mi>
<mi>i</mi>
</mrow>
</msub>
<mo>+</mo>
<mi>p</mi>
<mo>&times;</mo>
<munder>
<mo>&Sigma;</mo>
<mrow>
<msub>
<mi>B</mi>
<mi>j</mi>
</msub>
<mo>&Element;</mo>
<mi>I</mi>
</mrow>
</munder>
<msup>
<mi>k</mi>
<mrow>
<mi>I</mi>
<mo>.</mo>
<mi>l</mi>
<mi>e</mi>
<mi>n</mi>
<mi>g</mi>
<mi>t</mi>
<mi>h</mi>
</mrow>
</msup>
<mo>&times;</mo>
<msup>
<mi>&alpha;</mi>
<mrow>
<mi>l</mi>
<mi>o</mi>
<mi>g</mi>
<mrow>
<mo>(</mo>
<mi>a</mi>
<mi>b</mi>
<mi>s</mi>
<mo>(</mo>
<mrow>
<mi>L</mi>
<mrow>
<mo>(</mo>
<mrow>
<msub>
<mi>B</mi>
<mi>j</mi>
</msub>
<mo>,</mo>
<msub>
<mi>d</mi>
<mi>i</mi>
</msub>
</mrow>
<mo>)</mo>
</mrow>
<mo>-</mo>
<mi>&beta;</mi>
</mrow>
<mo>)</mo>
<mo>+</mo>
<mn>0.1</mn>
<mo>)</mo>
</mrow>
</mrow>
</msup>
<mo>&times;</mo>
<msub>
<mi>S</mi>
<mrow>
<mi>b</mi>
<mi>j</mi>
</mrow>
</msub>
<mo>-</mo>
<mo>-</mo>
<mo>-</mo>
<mrow>
<mo>(</mo>
<mn>6</mn>
<mo>)</mo>
</mrow>
</mrow>
Wherein, I represents classification, p, k, α, and in the case of β is adjustable parameter, system default, p is that 0.8, k is that 0.95, α is that 0.6, β is
5;
Step 8.3, patent to be sorted is included into the maximum classes of similarity S (i).
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710082677.8A CN107122382B (en) | 2017-02-16 | 2017-02-16 | Patent classification method based on specification |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710082677.8A CN107122382B (en) | 2017-02-16 | 2017-02-16 | Patent classification method based on specification |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107122382A true CN107122382A (en) | 2017-09-01 |
CN107122382B CN107122382B (en) | 2021-03-23 |
Family
ID=59717475
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710082677.8A Active CN107122382B (en) | 2017-02-16 | 2017-02-16 | Patent classification method based on specification |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107122382B (en) |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107679153A (en) * | 2017-09-27 | 2018-02-09 | 国家电网公司信息通信分公司 | A kind of patent classification method and device |
CN107844553A (en) * | 2017-10-31 | 2018-03-27 | 山东浪潮通软信息科技有限公司 | A kind of file classification method and device |
CN107862328A (en) * | 2017-10-31 | 2018-03-30 | 平安科技(深圳)有限公司 | The regular execution method of information word set generation method and rule-based engine |
CN108170666A (en) * | 2017-11-29 | 2018-06-15 | 同济大学 | A kind of improved method based on TF-IDF keyword extractions |
CN108227564A (en) * | 2017-12-12 | 2018-06-29 | 深圳和而泰数据资源与云技术有限公司 | A kind of information processing method, terminal and computer-readable medium |
CN108804512A (en) * | 2018-04-20 | 2018-11-13 | 平安科技(深圳)有限公司 | Generating means, method and the computer readable storage medium of textual classification model |
CN109213855A (en) * | 2018-09-12 | 2019-01-15 | 合肥汇众知识产权管理有限公司 | Document labeling method based on patent drafting |
CN109299263A (en) * | 2018-10-10 | 2019-02-01 | 上海观安信息技术股份有限公司 | File classification method, electronic equipment and computer program product |
CN110019822A (en) * | 2019-04-16 | 2019-07-16 | 中国科学技术大学 | A kind of few sample relationship classification method and system |
CN111930946A (en) * | 2020-08-18 | 2020-11-13 | 哈尔滨工程大学 | Patent classification method based on similarity measurement |
CN113849655A (en) * | 2021-12-02 | 2021-12-28 | 江西师范大学 | Patent text multi-label classification method |
CN116701633A (en) * | 2023-06-14 | 2023-09-05 | 上交所技术有限责任公司 | Industry classification method based on patent big data |
CN116975068A (en) * | 2023-09-25 | 2023-10-31 | 中国标准化研究院 | Metadata-based patent document data storage method, device and storage medium |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105808524A (en) * | 2016-03-11 | 2016-07-27 | 江苏畅远信息科技有限公司 | Patent document abstract-based automatic patent classification method |
-
2017
- 2017-02-16 CN CN201710082677.8A patent/CN107122382B/en active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105808524A (en) * | 2016-03-11 | 2016-07-27 | 江苏畅远信息科技有限公司 | Patent document abstract-based automatic patent classification method |
Cited By (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107679153A (en) * | 2017-09-27 | 2018-02-09 | 国家电网公司信息通信分公司 | A kind of patent classification method and device |
WO2019085075A1 (en) * | 2017-10-31 | 2019-05-09 | 平安科技(深圳)有限公司 | Information element set generation method and rule execution method based on rule engine |
CN107844553A (en) * | 2017-10-31 | 2018-03-27 | 山东浪潮通软信息科技有限公司 | A kind of file classification method and device |
CN107862328A (en) * | 2017-10-31 | 2018-03-30 | 平安科技(深圳)有限公司 | The regular execution method of information word set generation method and rule-based engine |
CN108170666A (en) * | 2017-11-29 | 2018-06-15 | 同济大学 | A kind of improved method based on TF-IDF keyword extractions |
CN108227564A (en) * | 2017-12-12 | 2018-06-29 | 深圳和而泰数据资源与云技术有限公司 | A kind of information processing method, terminal and computer-readable medium |
CN108227564B (en) * | 2017-12-12 | 2020-07-21 | 深圳和而泰数据资源与云技术有限公司 | Information processing method, terminal and computer readable medium |
CN108804512A (en) * | 2018-04-20 | 2018-11-13 | 平安科技(深圳)有限公司 | Generating means, method and the computer readable storage medium of textual classification model |
CN108804512B (en) * | 2018-04-20 | 2020-11-24 | 平安科技(深圳)有限公司 | Text classification model generation device and method and computer readable storage medium |
CN109213855A (en) * | 2018-09-12 | 2019-01-15 | 合肥汇众知识产权管理有限公司 | Document labeling method based on patent drafting |
CN109299263A (en) * | 2018-10-10 | 2019-02-01 | 上海观安信息技术股份有限公司 | File classification method, electronic equipment and computer program product |
CN109299263B (en) * | 2018-10-10 | 2021-01-05 | 上海观安信息技术股份有限公司 | Text classification method and electronic equipment |
CN110019822A (en) * | 2019-04-16 | 2019-07-16 | 中国科学技术大学 | A kind of few sample relationship classification method and system |
CN110019822B (en) * | 2019-04-16 | 2021-07-06 | 中国科学技术大学 | Few-sample relation classification method and system |
CN111930946A (en) * | 2020-08-18 | 2020-11-13 | 哈尔滨工程大学 | Patent classification method based on similarity measurement |
CN113849655A (en) * | 2021-12-02 | 2021-12-28 | 江西师范大学 | Patent text multi-label classification method |
CN116701633A (en) * | 2023-06-14 | 2023-09-05 | 上交所技术有限责任公司 | Industry classification method based on patent big data |
CN116975068A (en) * | 2023-09-25 | 2023-10-31 | 中国标准化研究院 | Metadata-based patent document data storage method, device and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN107122382B (en) | 2021-03-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107122382A (en) | A kind of patent classification method based on specification | |
CN105469096B (en) | A kind of characteristic bag image search method based on Hash binary-coding | |
CN107103043A (en) | A kind of Text Clustering Method and system | |
CN105808524A (en) | Patent document abstract-based automatic patent classification method | |
CN105426426B (en) | A kind of KNN file classification methods based on improved K-Medoids | |
CN107844559A (en) | A kind of file classifying method, device and electronic equipment | |
Noaman et al. | Naive Bayes classifier based Arabic document categorization | |
CN109376352B (en) | Patent text modeling method based on word2vec and semantic similarity | |
CN102194013A (en) | Domain-knowledge-based short text classification method and text classification system | |
CN106228554B (en) | Fuzzy coarse central coal dust image partition method based on many attribute reductions | |
CN102141978A (en) | Method and system for classifying texts | |
CN102411563A (en) | Method, device and system for identifying target words | |
CN107291895B (en) | Quick hierarchical document query method | |
CN103049569A (en) | Text similarity matching method on basis of vector space model | |
CN103605702A (en) | Word similarity based network text classification method | |
CN110990676A (en) | Social media hotspot topic extraction method and system | |
Lou et al. | Multilabel subject-based classification of poetry | |
CN105787097A (en) | Distributed index establishment method and system based on text clustering | |
CN104820724B (en) | Text class educational resource knowledge point forecast model preparation method and application method | |
CN108351974A (en) | K extreme value is searched within constant processing time | |
CN105975518A (en) | Information entropy-based expected cross entropy feature selection text classification system and method | |
CN105183831A (en) | Text classification method for different subject topics | |
CN1158460A (en) | Multiple languages automatic classifying and searching method | |
CN111026870A (en) | ICT system fault analysis method integrating text classification and image recognition | |
CN103268346B (en) | Semisupervised classification method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |