CN111708865B - Technology forecasting and patent early warning analysis method based on improved XGboost algorithm - Google Patents

Technology forecasting and patent early warning analysis method based on improved XGboost algorithm Download PDF

Info

Publication number
CN111708865B
CN111708865B CN202010557407.XA CN202010557407A CN111708865B CN 111708865 B CN111708865 B CN 111708865B CN 202010557407 A CN202010557407 A CN 202010557407A CN 111708865 B CN111708865 B CN 111708865B
Authority
CN
China
Prior art keywords
xgboost
early warning
algorithm
early
technical
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010557407.XA
Other languages
Chinese (zh)
Other versions
CN111708865A (en
Inventor
黄梦醒
李茂�
冯思玲
冯文龙
张雨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hainan University
Original Assignee
Hainan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hainan University filed Critical Hainan University
Priority to CN202010557407.XA priority Critical patent/CN111708865B/en
Publication of CN111708865A publication Critical patent/CN111708865A/en
Application granted granted Critical
Publication of CN111708865B publication Critical patent/CN111708865B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/006Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/18Legal services
    • G06Q50/184Intellectual property management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Technology Law (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Tourism & Hospitality (AREA)
  • Molecular Biology (AREA)
  • Operations Research (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Economics (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a technology forecasting and patent early warning analysis method based on an improved XGboost algorithm, which comprises the steps that a user inputs a patent search formula, a patent subject database is constructed according to the patent search formula, a convolutional neural network is adopted to carry out feature extraction on patent texts of the patent subject database and obtain feature vectors, a test set is constructed according to the feature vectors, an XGboost model is trained through the training set, then the XGboost model is improved through a wolf optimization algorithm, the classification precision and the classification efficiency are improved, an XGboost classifier is obtained after the XGboost model is tested through the test set, after a patent to be early warned is input into the XGboost classifier, the classification of the patent to be early warned and other patent texts in the same class as the patent to be early warned can be obtained, so that the patent early warning analysis, the technology maturity and the technology evolution direction forecasting can be carried out, a forecasting result with high accuracy and visualization degree can be provided for the user, the user can understand the development situation of the prior art and the future evolution direction at a glance.

Description

Technology forecasting and patent early warning analysis method based on improved XGboost algorithm
Technical Field
The invention relates to the technical field of patent information processing, in particular to a technical forecasting and patent early warning analysis method based on an improved XGboost algorithm.
Background
With the rapid development of science and technology, various high and new technologies emerge endlessly, intellectual property rights are increasingly paid attention to people, the competitive environment of the market is more and more complex, how enterprises keep leading in the intense technical competitive environment is important to improve the level of competitiveness of the enterprises, and the patents increasingly become the core elements of the level of competitiveness of the enterprises, so that the enterprises can analyze the existing patents to realize technical forecast and patent early warning, thereby avoiding trapping in patent traps and mastering the future development situation of the technology.
A technology competition and patent early warning analysis method based on knowledge discovery is disclosed as CN106897392A, a special database is established after a user inputs an index formula, a cluster data set of the special database is obtained by utilizing data mining and knowledge discovery tools such as vector space, mathematical statistics and the like, then patent early warning and patent theme life cycle analysis are carried out on the cluster data set, and a visual result is provided for the user, so that the technology competition and patent early warning are realized.
The method for analyzing the cross-domain patent early warning information based on the multi-branch tree with the publication number of CN106845798A has the core idea that collected patent data are screened, classified, subjected to feature extraction and the like, the multi-branch tree in the patent domain is built, each leaf node stores patent technology and associated user information, leaf nodes matched with patents to be early warned are searched in the tree, and cross-domain patent early warning is carried out according to different matching results; the technology has the defects that data collection needs to collect a large amount of patent technology information and the related user information of the technology, a large amount of time and labor are consumed in the process, the collected information is not necessarily reliable, the stored content of leaf nodes is not accurate, errors are prone to occurring in searching and matching, and the patent early warning accuracy is greatly influenced.
A big data patent early warning service system based on a genetic algorithm with the publication number of CN107369007A adopts the data mining algorithm based on the genetic algorithm in a data mining module to classify a patent data set, and then analyzes a classification result to realize patent early warning.
Disclosure of Invention
Therefore, the invention provides a technical forecasting and patent early warning analysis method based on an improved XGboost algorithm, which is used for optimizing parameters by using the improved XGboost algorithm and classifying patent subject databases to improve the classification precision and the classification efficiency.
The technical scheme of the invention is realized as follows:
a technology forecasting and patent early warning analysis method based on an improved XGboost algorithm comprises the following steps:
step S1, constructing a patent theme database according to the patent retrieval formula;
step S2, extracting the features of the patent texts in the patent subject database by using a convolutional neural network to obtain a feature vector Vk
Step S3, according to the feature vector VkConstructing a test set S;
step S4, inputting the training set into an XGboost model for training, optimizing and improving the number of base classifiers, the learning rate, the maximum depth of a tree and the minimum leaf node sample weight of the XGboost model by adopting a wolf optimization algorithm, and inputting the testing set into the improved XGboost model for testing to obtain the XGboost classifier;
step S5, after extracting the characteristics of the patent to be early-warned, inputting the patent to be early-warned into an XGboost classifier to obtain the classification of the patent to be early-warned and other patent texts in the same class as the patent to be early-warned;
s6, carrying out patent early warning analysis, technical maturity and technical evolution direction prediction according to the classification of the patents to be early warned and other patent texts in the same class as the patents to be early warned to obtain an analysis prediction result;
and step S7, visually displaying the analysis and prediction result and sending the analysis and prediction result to a user.
Preferably, step S1 is preceded by:
and step S0, setting a patent early warning threshold value and an early warning result receiving mode.
Preferably, the specific step of step S1 is: and extracting and analyzing the corresponding intellectual property database and the knowledge of the relevant industrial fields according to the patent retrieval formula, constructing a patent theme database, and simultaneously carrying out text denoising pretreatment.
Preferably, the specific step of step S2 is: the convolutional neural network comprises an input layer, a convolutional layer, a pooling layer and an output layer, and the patent text F in the patent subject databasekForming a patent text representation matrix M after passing through an input layerkPatent text representation matrix MkExpressed as a feature vector V by the operation of convolution and pooling layersk
Preferably, the test set S ═ { S } of step S3 is setk=(Vk,Lk) S is a test set consisting of feature vectors and labels of all documents in a patent subject database, LkIs a sample skThe corresponding patent label classification number.
Preferably, the step S4 includes:
step S41, setting population scale N and maximum iteration times T in parameters of a gray wolf optimization algorithm, setting the number of base classifiers, the learning rate, the maximum depth of a tree and the value range of the minimum leaf node sample weight in the parameters of the XGboost model, and initializing other parameters of the XGboost model;
step S42, randomly generating gray wolf clusters, wherein the individual position of each gray wolf cluster consists of the number of base classifiers, the learning rate, the maximum depth of the tree and the sample weight of the minimum leaf node;
step S43, XGboost model according to the initial base classifier number, learning rate, maximum tree depth and minimum leaf node sample weightLearning the training set according to a fitness function FnewCalculating the fitness function value of each wolf;
step S44, dividing the gray wolf group into 4 gray wolf alpha, beta, delta and omega with different grades according to the fitness function value;
step S45, updating the position of each individual in the grey wolf group, recalculating the fitness function value of each grey wolf individual at the new position, and performing the fitness function value F with the last iteration optimal fitness function valuegMaking a comparison if Fnew>FgIf the function value of the individual fitness of the wolf is FnewAnd the position of the wolf individual is reserved, otherwise, the wolf individual fitness function value is Fg
Step S46, repeating the step S42-step S45, stopping iteration when the iteration times is more than T, and outputting the optimal values of the number of base classifiers, the learning rate, the maximum depth of the tree and the minimum leaf node sample weight of the XGboost model;
and step S47, inputting the test set S into an XGboost model for testing, and obtaining a trained XGboost classifier.
Preferably, the fitness function F of step S43newIs expressed as
Fnew=(FPrecision+FRecall+F1)/3;
Wherein, FPrecisionFor accuracy, the expression is:
Figure GDA0003085126130000041
FRecallfor recall, the expression is:
Figure GDA0003085126130000042
F1for measuring the index of classification accuracy, the expression is as follows:
Figure GDA0003085126130000043
wherein, TP, FP and FN are real examples, false positive examples and false negative examples which are obtained by dividing according to the real categories and the prediction categories of the patent texts.
Preferably, the patent early warning analysis in step S6 includes the specific steps of: and calculating the similarity of the patent to be early-warned and other patents belonging to the same category by using a SimHash algorithm, and outputting the patent with the similarity exceeding a patent early-warning threshold value.
Preferably, the specific steps of predicting the technology maturity and the technology evolution direction in step S6 are as follows: drawing a patent characteristic fitting curve according to a TRIZ theory, comparing the fitting curve with a standard S curve, and meanwhile, predicting the patent technology maturity by combining a patent data prediction algorithm; and analyzing the technical evolution process in detail by using a technical evolution radar map, and displaying a visualization result to predict the technical evolution direction.
Preferably, the analysis prediction result in step S7 is sent to the user in the warning result receiving manner set in step S0.
Compared with the prior art, the invention has the beneficial effects that:
the invention provides a technology forecasting and patent early warning analysis method based on an improved XGboost algorithm, which comprises the steps of extracting the characteristics of patent texts of a patent subject database through a convolutional neural network, constructing a test set, training an XGboost model through the training set, improving the XGboost model by adopting a wolf optimization algorithm, obtaining the optimal parameters of the XGboost model, testing the XGboost model through the test set to obtain an XGboost classifier, ensuring the classification accuracy of the XGboost classifier, classifying the early warning patents through the XGboost classifier, using the classification of the patents to be early warned obtained after classification and other patent texts in the same class as the classification of the patents to be early warned to perform patent early warning analysis, technology maturity and technology direction prediction, finally obtaining an analysis prediction result, improving the classification accuracy of the XGboost model and improving the accuracy through the wolf optimization algorithm, and the times of operation are reduced, and the classification efficiency is improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only preferred embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive exercise.
FIG. 1 is a flow chart of a technical anticipation and patent early warning analysis method based on an improved XGboost algorithm according to the present invention;
FIG. 2 is a diagram of a technical evolution radar chart based on the improved XGboost algorithm technical forecast and patent early warning analysis method of the invention;
fig. 3 is a technical evolution analysis process diagram of a technical forecast and patent early warning analysis method based on an improved XGBoost algorithm.
Detailed Description
For a better understanding of the technical content of the present invention, a specific embodiment is provided below, and the present invention is further described with reference to the accompanying drawings.
Referring to fig. 1, the technical anticipation and patent early warning analysis method based on the improved XGBoost algorithm provided by the present invention includes the following steps:
step S1, constructing a patent theme database according to the patent retrieval formula;
step S2, extracting the features of the patent texts in the patent subject database by using a convolutional neural network to obtain a feature vector Vk
Step S3, according to the feature vector VkConstructing a test set S;
step S4, inputting the training set into an XGboost model for training, optimizing and improving the number of base classifiers, the learning rate, the maximum depth of a tree and the minimum leaf node sample weight of the XGboost model by adopting a wolf optimization algorithm, and inputting the testing set into the improved XGboost model for testing to obtain the XGboost classifier;
step S5, after extracting the characteristics of the patent to be early-warned, inputting the patent to be early-warned into an XGboost classifier to obtain the classification of the patent to be early-warned and other patent texts in the same class as the patent to be early-warned;
s6, carrying out patent early warning analysis, technical maturity and technical evolution direction prediction according to the classification of the patents to be early warned and other patent texts in the same class as the patents to be early warned to obtain an analysis prediction result;
and step S7, visually displaying the analysis and prediction result and sending the analysis and prediction result to a user.
The invention relates to a technology forecasting and patent early warning analysis method based on an improved XGboost algorithm, which comprises the steps of firstly, after a user inputs a corresponding patent retrieval formula on an interface according to a prompt, constructing a patent subject database according to the patent retrieval formula, then, extracting the characteristics of patent texts in the patent subject database by adopting a convolutional neural network, extracting the characteristic vectors in the patent texts, using the obtained characteristic vectors to construct a test set S, training an XGboost model according to the training set, then, improving the XGboost model by adopting a Grey wolf optimization algorithm, finally, testing the XGboost model according to a test set prepared in advance to enable the parameters of the XGboost model to be in an optimal state, obtaining a trained XGboost classifier at the moment, then, inputting the patents to be early warned into the XGboost classifier, classifying by the XGboost classifier to obtain the classification of the patents to be early warned and other patent texts in the same class as the patents to be early warned, therefore, patent early warning analysis, technical maturity and technical evolution direction prediction can be carried out, the final analysis prediction result is visually displayed, feature extraction is carried out through a convolutional neural network, complex data preprocessing can be avoided, the method does not depend on external knowledge, the user friendliness degree is high, after the XGboost model is optimized through the Hui wolf optimization algorithm, the classification precision and the classification efficiency can be improved, the time complexity is reduced, and a user can carry out patent early warning analysis, technical maturity and technical evolution direction prediction on the patent to be early warned only by inputting a patent retrieval formula.
Preferably, step S1 is preceded by:
and step S0, setting a patent early warning threshold value and an early warning result receiving mode.
After a user inputs a patent retrieval formula on an interface, the user also needs to set a patent early warning threshold and an early warning result receiving mode by himself, because the patent early warning process of the invention is carried out by adopting a SimHash algorithm, and according to an empirical value, for a 64-bit SimHash value, the similarity of hamming distance within 3 can be considered to be higher, so the patent early warning threshold is set as 3 by default, but the user can also select the patent early warning threshold according to specific use, and the early warning result receiving mode can be various, such as mail receiving and the like.
Preferably, the specific step of step S1 is: and extracting and analyzing the corresponding intellectual property database and the knowledge of the relevant industrial fields according to the patent retrieval formula, constructing a patent theme database, and simultaneously carrying out text denoising pretreatment.
Preferably, the specific step of step S2 is: the convolutional neural network comprises an input layer, a convolutional layer, a pooling layer and an output layer, and the patent text F in the patent subject databasekForming a patent text representation matrix M after passing through an input layerkPatent text representation matrix MkExpressed as a feature vector V by the operation of convolution and pooling layersk
The invention adopts a word vector combined convolutional neural network algorithm to extract the characteristics of a patent text of a constructed patent subject database to form characteristic vectors, wherein the convolutional neural network comprises a convolutional layer, a pooling layer, a full-link layer and an output layer, the neurons of adjacent layers are connected with one another, and the neurons of the same layer are not connected with one another.
Inputting a layer: will be each word w in the patent textiConversion into vectors v by a word vector dictionaryiThe word vector has the advantages of making up the defects of the BOW and TF-IDF models in expressing the grammar, the sequence and the semantic relation of the words, and for the patent document FkIn other words, a patent text representation matrix is formed by vector join operations, denoted as Mk={v1,v2,...,vn}。
② rolling and laminating: the convolution kernel passes through the matrix MkMiddle-sized trimmerThe continuous sliding of the lines realizes the extraction of local features, the width b of the convolution kernel and the word vector viThe dimensions of the convolution kernel are consistent, the height h of the convolution kernel represents the range of local text features to be extracted, and a good feature extraction effect can be obtained when the value of h is between 2 and 5. Using n convolution kernels in matrix MkAnd (4) performing middle sliding and performing convolution operation.
Let Mk[i:j]Is the i-j line, D in the patent text matrixiRepresenting the ith convolution kernel, the output of the convolution kernel can be represented as
ri=Mk[i:i+h-1]·Di
Ci=f(ri+b)
Where is a dot product operation, CiThe method is characterized by learning of the ith convolution kernel, b is a bias variable, f is an activation function such as Sigmoid, and ReLU is selected as a nonlinear activation function because the ReLU has higher convergence speed and no gradient saturation while the computational complexity is reduced compared with the activation function such as Sigmoid.
③ a pooling layer: all local features C obtained for convolutional layer by the maximum pooling functioniPerforming aggregation, the maximum pooling function acting on each feature C capturediTo reduce dimensionality and obtain the features with the highest values, the expression of the maximum pooling function is:
Wi=poolingmax(Ci);
wherein WiIs effected on the local feature C by means of a maximum pooling functioniThe resulting maximum, n eigenvectors generated for n convolution kernels may be denoted as Vk={W1,W2,...,Wn}。
An output layer: feature vector V obtained from pooling layerkAnd outputting the data.
Preferably, the test set S ═ { S } of step S3 is setk=(Vk,Lk) S is a test set consisting of feature vectors and labels of all documents in a patent subject database, LkIs a sample skThe corresponding patent label classification number.
Preferably, the step S4 includes:
step S41, setting population scale N and maximum iteration times T in parameters of a gray wolf optimization algorithm, setting the number of base classifiers, the learning rate, the maximum depth of a tree and the value range of the minimum leaf node sample weight in the parameters of the XGboost model, and initializing other parameters of the XGboost model;
step S42, randomly generating gray wolf clusters, wherein the individual position of each gray wolf cluster consists of the number of base classifiers, the learning rate, the maximum depth of the tree and the sample weight of the minimum leaf node;
step S43, the XGboost model learns the training set according to the initial number of the base classifiers, the learning rate, the maximum depth of the tree and the minimum leaf node sample weight, and learns the training set according to the fitness function FnewCalculating the fitness function value of each wolf;
step S44, dividing the gray wolf group into 4 gray wolf alpha, beta, delta and omega with different grades according to the fitness function value;
step S45, updating the position of each individual in the grey wolf group, recalculating the fitness function value of each grey wolf individual at the new position, and performing the fitness function value F with the last iteration optimal fitness function valuegMaking a comparison if Fnew>FgIf the function value of the individual fitness of the wolf is FnewAnd the position of the wolf individual is reserved, otherwise, the wolf individual fitness function value is Fg
Step S46, repeating the step S42-step S45, stopping iteration when the iteration times is more than T, and outputting the optimal values of the number of base classifiers, the learning rate, the maximum depth of the tree and the minimum leaf node sample weight of the XGboost model;
and step S47, inputting the test set S into an XGboost model for testing, and obtaining a trained XGboost classifier.
Preferably, the fitness function F of step S43newIs expressed as
Fnew=(FPrecision+FRecall+F1)/3;
Wherein, FPrecisionFor accuracy, the expression is:
Figure GDA0003085126130000091
FRecallfor recall, the expression is:
Figure GDA0003085126130000092
F1for measuring the index of classification accuracy, the expression is as follows:
Figure GDA0003085126130000093
wherein, TP, FP and FN are real examples, false positive examples and false negative examples which are obtained by dividing according to the real categories and the prediction categories of the patent texts.
The XGboost model is optimized by adopting a gray wolf optimization algorithm, four parameters which have large influence on the model are selected for iterative optimization because the XGboost model contains more parameters, and other parameters are set as default values for improving the classification precision and the classification efficiency of the modelPrecisionRecall rate FRecallAnd F1Evaluating the classification accuracy of the model by using three indexes, and taking the macro average as a fitness function, wherein the macro average is the accuracy F of all classesPrecisionRecall rate FRecallAnd F1The values are averaged to evaluate the mesosome performance of the patent text classification.
The improved XGboost model comprises the following specific implementation steps:
a. initializing the weak learner:
Figure GDA0003085126130000094
in the case of a loss of square,
Figure GDA0003085126130000101
b. iteratively, M basis learners are generated, for M1, 2.
1) For each sample i 1, 2.. times.n, a negative gradient, i.e. a residual, is calculated:
Figure GDA0003085126130000102
2) taking the residual error obtained in the previous step as a new true value of the sample, and taking the data (x)i,xim) I 1,2, n is used as training data of the next tree, and a new regression tree f is obtainedm(x) The corresponding leaf node region is RjmJ is 1, 2. Wherein J is the number of leaf nodes of the regression tree t.
3) For the leaf region RjmJ1, 2.. J, calculating the best fit value, deriving γ and making the derivative be 0:
Figure GDA0003085126130000103
4) updating the strong learner:
Figure GDA0003085126130000104
c. obtaining a final learner:
Figure GDA0003085126130000105
d. and obtaining the classification result of each patent document by using a final learner in a scoring mode.
Preferably, the patent early warning analysis in step S6 includes the specific steps of: and calculating the similarity of the patent to be early-warned and other patents belonging to the same category by using a SimHash algorithm, and outputting the patent with the similarity exceeding a patent early-warning threshold value.
The invention adopts the SimHash algorithm to carry out patent early warning analysis, utilizes the SimHash algorithm to calculate the similarity between a patent to be early warned and other patents belonging to the same category, and outputs the patent with the similarity exceeding an early warning threshold value, thereby realizing patent early warning, the main idea of the SimHash algorithm is dimension reduction, high-dimensional eigenvector is mapped into low-dimensional eigenvector, whether articles are repeated or are highly similar is determined by the Hamming distance between the two vectors, and the SimHash algorithm of the invention is divided into 4 steps: word segmentation and weight calculation, hash calculation, weighting and merging and dimension reduction output.
The first step is as follows: and (3) performing word segmentation and weight calculation, performing word segmentation processing on the words, calculating the weight of each word segmentation in the text, considering the k words before selection for the text with overlong length, and performing calculation to obtain k keyword weight pairs.
The second step is that: and (3) performing hash calculation, namely calculating the hash value of each keyword through a hash function, wherein the hash value is an n-bit signature consisting of binary numbers 0 and 1, and the keyword-weight pair is converted into a hash value-weight pair.
The third step: and weighting and merging, namely performing bitwise multiplication on the weight of the keyword obtained in the first step and the hash value of the keyword, namely W (hash) weight, positively multiplying the hash value and the weight when the bit is 1, negatively multiplying the hash value and the weight when the bit is 0, and merging and accumulating the weighted values of the text keywords if the global characteristics of the text need to be analyzed.
The fourth step: and (4) performing dimension reduction output, wherein the weighting result of the third step already generates feature codes of the text, the purpose of dimension reduction is to reduce the complexity of the feature codes, each bit of the feature codes is judged, the value of the feature codes is more than or equal to 0 and is set as 1, the value of the feature codes is less than 0 and is set as-1, so that the SimHash value of the text is obtained, and finally, whether the similarity of the text exceeds an early warning threshold value or not is judged according to the Hamming distance of different text SimHash values.
Preferably, the specific steps of predicting the technology maturity and the technology evolution direction in step S6 are as follows: drawing a patent characteristic fitting curve according to a TRIZ theory, comparing the fitting curve with a standard S curve, and meanwhile, predicting the patent technology maturity by combining a patent data prediction algorithm; and analyzing the technical evolution process in detail by using a technical evolution radar map, and displaying a visualization result to predict the technical evolution direction.
The technical maturity prediction of the invention is carried out based on the TRIZ theory, a patent characteristic fitting curve is drawn according to the TRIZ theory, the fitting curve is compared with a standard S curve, meanwhile, the patent technology maturity prediction is carried out by combining a patent data measurement algorithm, the TRIZ theory provides 4 stages of the evolution of technology through infancy stage, growth stage, maturity stage and decline stage, the technical maturity prediction mainly inspects 4 indexes of performance parameters, patent grade, patent quantity and economic benefit, a patent document of a certain label is analyzed, firstly, the number and grade of the patents are counted, a curve which changes along with time is drawn, then the performance parameters and the main indexes of the economic benefit of the label patent technology are investigated and researched, a corresponding performance parameter change curve and an economic benefit change curve are drawn, and then, a proper fitting model is selected to draw the patent characteristic fitting curve, finally, the patent characteristic fitting curve is compared with a standard S curve, meanwhile, the slopes of the 4 curves obtained by comprehensively analyzing the 4 indexes can be used for judging the position of the label patent technology on the S curve, namely the current life cycle of the label patent technology, and therefore the maturity prediction of the patent technology is achieved.
The technical evolution direction prediction is to analyze by using a technical evolution radar map, and to clearly see the place where the technology needs to be improved and innovated by visually showing the difference between the patent technology and the evolution limit, the technical evolution direction prediction is as shown in fig. 2, wherein the center of a polygon is the lowest level of the technical evolution, the periphery of the polygon is the limit of the technical evolution, each spoke represents an evolution route, and scales on the spokes represent the series of the evolution route. The method comprises the following steps of connecting positioning points of the prior patent technology system on each route into a line to obtain a shadow part, representing the current state of the patent technology system, representing the development potential of the patent technology system by a blank part of a polygonal area which is not covered by the shadow, dividing a technology system into a plurality of subsystems, drawing a technology radar map of each subsystem, judging which subsystems of the technology system have better performance and which subsystems are weak links, and predicting the technology evolution direction by utilizing the technology evolution radar map, wherein the specific steps of: firstly, analyzing a technical system formed by a patent document of a certain label, and designing a plurality of technical routes related to the technical system, namely possible evolution directions of the technical system; then positioning the technical system in each technical route, namely the evolution state of the technical system in each evolution direction at present, and drawing a technical evolution radar map; and then analyzing the radar map, if a technical innovation point is found, carrying out technical innovation on the technical system, otherwise, subdividing the radar map to obtain a radar tree map, and repeating the analysis steps, wherein the technical evolution analysis process is shown in fig. 3.
Preferably, the analysis prediction result in step S7 is sent to the user in the warning result receiving manner set in step S0.
The invention sequentially outputs the analysis and prediction results after the analysis method is executed for each user, wherein the analysis and prediction results comprise a patent early warning analysis result, a technical maturity prediction result and a technical evolution direction prediction result, for the patent early warning analysis, all patents with similarity exceeding an early warning threshold value with the patents to be early warned are output, and simultaneously, the patents are sent to the user in real time according to a receiving mode selected by the user, so that the patents are prevented from being trapped in a patent trap.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (10)

1. A technology forecasting and patent early warning analysis method based on an improved XGboost algorithm is characterized by comprising the following steps:
step S1, constructing a patent theme database according to the patent retrieval formula;
step S2, adopting convolution neural netCarrying out feature extraction on the patent text of the patent subject database to obtain a feature vector Vk
Step S3, according to the feature vector VkConstructing a test set S;
step S4, inputting the training set into an XGboost model for training, optimizing and improving the number of base classifiers, the learning rate, the maximum depth of a tree and the minimum leaf node sample weight of the XGboost model by adopting a wolf optimization algorithm, and inputting the testing set into the improved XGboost model for testing to obtain the XGboost classifier;
step S5, after extracting the characteristics of the patent to be early-warned, inputting the patent to be early-warned into an XGboost classifier to obtain the classification of the patent to be early-warned and other patent texts in the same class as the patent to be early-warned;
s6, carrying out patent early warning analysis, technical maturity and technical evolution direction prediction according to the classification of the patents to be early warned and other patent texts in the same class as the patents to be early warned to obtain an analysis prediction result;
and step S7, visually displaying the analysis and prediction result and sending the analysis and prediction result to a user.
2. The advanced XGboost algorithm-based technical anticipation and patent early warning analysis method as claimed in claim 1, wherein the step S1 is preceded by the steps of:
and step S0, setting a patent early warning threshold value and an early warning result receiving mode.
3. The improved XGboost algorithm-based technical anticipation and patent early warning analysis method according to claim 1, wherein the specific steps of the step S1 are as follows: and extracting and analyzing the corresponding intellectual property database and the knowledge of the relevant industrial fields according to the patent retrieval formula, constructing a patent theme database, and simultaneously carrying out text denoising pretreatment.
4. The improved XGboost algorithm-based technical anticipation and patent early warning analysis method according to claim 1, wherein the step S2 is executedThe method comprises the following specific steps: the convolutional neural network comprises an input layer, a convolutional layer, a pooling layer and an output layer, and the patent text F in the patent subject databasekForming a patent text representation matrix M after passing through an input layerkPatent text representation matrix MkExpressed as a feature vector V by the operation of convolution and pooling layersk
5. The improved XGboost algorithm-based technical anticipation and patent early warning analysis method as claimed in claim 1, wherein the test set S ═ { S } of the step S3k=(Vk,Lk) S is a test set consisting of feature vectors and labels of all documents in a patent subject database, LkIs a sample skThe corresponding patent label classification number.
6. The improved XGboost algorithm-based technical anticipation and patent early warning analysis method according to claim 1, wherein the step S4 comprises:
step S41, setting population scale N and maximum iteration times T in parameters of a gray wolf optimization algorithm, setting the number of base classifiers, the learning rate, the maximum depth of a tree and the value range of the minimum leaf node sample weight in the parameters of the XGboost model, and initializing other parameters of the XGboost model;
step S42, randomly generating gray wolf clusters, wherein the individual position of each gray wolf cluster consists of the number of base classifiers, the learning rate, the maximum depth of the tree and the sample weight of the minimum leaf node;
step S43, the XGboost model learns the training set according to the initial number of the base classifiers, the learning rate, the maximum depth of the tree and the minimum leaf node sample weight, and learns the training set according to the fitness function FnewCalculating the fitness function value of each wolf;
step S44, dividing the gray wolf group into 4 gray wolf alpha, beta, delta and omega with different grades according to the fitness function value;
step S45, updating the position of each individual in the gray wolf group, recalculating the position of each individual in the gray wolf groupFitness function value of new position and the last iteration optimal fitness function value FgMaking a comparison if Fnew>FgIf the function value of the individual fitness of the wolf is FnewAnd the position of the wolf individual is reserved, otherwise, the wolf individual fitness function value is Fg
Step S46, repeating the step S42-step S45, stopping iteration when the iteration times is more than T, and outputting the optimal values of the number of base classifiers, the learning rate, the maximum depth of the tree and the minimum leaf node sample weight of the XGboost model;
and step S47, inputting the test set S into an XGboost model for testing, and obtaining a trained XGboost classifier.
7. The improved XGboost algorithm-based technical anticipation and patent early warning analysis method according to claim 6, wherein the fitness function F of the step S43newIs expressed as
Fnew=(FPrecision+FRecall+F1)/3;
Wherein, FPrecisionFor accuracy, the expression is:
Figure FDA0003063451010000031
FRecallfor recall, the expression is:
Figure FDA0003063451010000032
F1for measuring the index of classification accuracy, the expression is as follows:
Figure FDA0003063451010000033
wherein, TP, FP and FN are real examples, false positive examples and false negative examples which are obtained by dividing according to the real categories and the prediction categories of the patent texts.
8. The advanced XGboost algorithm-based technical anticipation and patent early warning analysis method as claimed in claim 2, wherein the patent early warning analysis in step S6 comprises the following specific steps: and calculating the similarity of the patent to be early-warned and other patents belonging to the same category by using a SimHash algorithm, and outputting the patent with the similarity exceeding a patent early-warning threshold value.
9. The improved XGboost algorithm-based technical forecasting and patent early warning analysis method as claimed in claim 1, wherein the specific steps of predicting the technical maturity and the technical evolution direction in the step S6 are as follows: drawing a patent characteristic fitting curve according to a TRIZ theory, comparing the fitting curve with a standard S curve, and meanwhile, predicting the patent technology maturity by combining a patent data prediction algorithm; and analyzing the technical evolution process in detail by using a technical evolution radar map, and displaying a visualization result to predict the technical evolution direction.
10. The advanced XGboost algorithm-based technical anticipation and patent early warning analysis method as claimed in claim 2, wherein the analysis and prediction result in step S7 is sent to the user in an early warning result receiving manner set in step S0.
CN202010557407.XA 2020-06-18 2020-06-18 Technology forecasting and patent early warning analysis method based on improved XGboost algorithm Active CN111708865B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010557407.XA CN111708865B (en) 2020-06-18 2020-06-18 Technology forecasting and patent early warning analysis method based on improved XGboost algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010557407.XA CN111708865B (en) 2020-06-18 2020-06-18 Technology forecasting and patent early warning analysis method based on improved XGboost algorithm

Publications (2)

Publication Number Publication Date
CN111708865A CN111708865A (en) 2020-09-25
CN111708865B true CN111708865B (en) 2021-07-09

Family

ID=72540975

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010557407.XA Active CN111708865B (en) 2020-06-18 2020-06-18 Technology forecasting and patent early warning analysis method based on improved XGboost algorithm

Country Status (1)

Country Link
CN (1) CN111708865B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112801140A (en) * 2021-01-07 2021-05-14 长沙理工大学 XGboost breast cancer rapid diagnosis method based on moth fire suppression optimization algorithm
CN114615010B (en) * 2022-01-19 2023-12-15 上海电力大学 Edge server-side intrusion prevention system design method based on deep learning

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107563394A (en) * 2017-09-19 2018-01-09 广东工业大学 A kind of method and system of predicted pictures popularity
CN107908688A (en) * 2017-10-31 2018-04-13 温州大学 A kind of data classification Forecasting Methodology and system based on improvement grey wolf optimization algorithm
CN109190828A (en) * 2018-09-07 2019-01-11 苏州大学 Gas leakage concentration distribution determines method, apparatus, equipment and readable storage medium storing program for executing
CN110110848A (en) * 2019-05-05 2019-08-09 武汉烽火众智数字技术有限责任公司 A kind of combination forecasting construction method and device
CN110289097A (en) * 2019-07-02 2019-09-27 重庆大学 A kind of Pattern Recognition Diagnosis system stacking model based on Xgboost neural network

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107563394A (en) * 2017-09-19 2018-01-09 广东工业大学 A kind of method and system of predicted pictures popularity
CN107908688A (en) * 2017-10-31 2018-04-13 温州大学 A kind of data classification Forecasting Methodology and system based on improvement grey wolf optimization algorithm
CN109190828A (en) * 2018-09-07 2019-01-11 苏州大学 Gas leakage concentration distribution determines method, apparatus, equipment and readable storage medium storing program for executing
CN110110848A (en) * 2019-05-05 2019-08-09 武汉烽火众智数字技术有限责任公司 A kind of combination forecasting construction method and device
CN110289097A (en) * 2019-07-02 2019-09-27 重庆大学 A kind of Pattern Recognition Diagnosis system stacking model based on Xgboost neural network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Research on Load Prediction Based on Improve GWO and ELM in Cloud Computing;Shengcai Zhang.ET-AL;《2019 IEEE 5th International Conference on Computer and Communications (ICCC)》;20200413;第102-105页 *
Taxi Trip Travel Time Prediction with Isolated XGBoost Regression;Kusal D. Kankanamge.ET-AL;《2019 Moratuwa Engineering Research Conference (MERCon)》;20190705;第54-59页 *

Also Published As

Publication number Publication date
CN111708865A (en) 2020-09-25

Similar Documents

Publication Publication Date Title
CN108399158B (en) Attribute emotion classification method based on dependency tree and attention mechanism
CN110245229B (en) Deep learning theme emotion classification method based on data enhancement
CN109241377A (en) A kind of text document representation method and device based on the enhancing of deep learning topic information
CN111597340A (en) Text classification method and device and readable storage medium
CN112687374B (en) Psychological crisis early warning method based on text and image information joint calculation
CN112732921B (en) False user comment detection method and system
CN115688024B (en) Network abnormal user prediction method based on user content characteristics and behavior characteristics
CN111708865B (en) Technology forecasting and patent early warning analysis method based on improved XGboost algorithm
Pandey et al. Fake news detection from online media using machine learning classifiers
CN112528668A (en) Deep emotion semantic recognition method, system, medium, computer equipment and terminal
CN112416358B (en) Intelligent contract code defect detection method based on structured word embedded network
CN115952292B (en) Multi-label classification method, apparatus and computer readable medium
CN112148868A (en) Law recommendation method based on law co-occurrence
CN113240201A (en) Method for predicting ship host power based on GMM-DNN hybrid model
CN114942974A (en) E-commerce platform commodity user evaluation emotional tendency classification method
CN109376235A (en) The feature selection approach to be reordered based on document level word frequency
CN112489689B (en) Cross-database voice emotion recognition method and device based on multi-scale difference countermeasure
CN114519508A (en) Credit risk assessment method based on time sequence deep learning and legal document information
Saha et al. The Corporeality of Infotainment on Fans Feedback Towards Sports Comment Employing Convolutional Long-Short Term Neural Network
CN116629258A (en) Structured analysis method and system for judicial document based on complex information item data
CN113837266B (en) Software defect prediction method based on feature extraction and Stacking ensemble learning
CN114358813B (en) Improved advertisement putting method and system based on field matrix factorization machine
CN117235253A (en) Truck user implicit demand mining method based on natural language processing technology
Selvi et al. Topic categorization of Tamil news articles
CN113821571A (en) Food safety relation extraction method based on BERT and improved PCNN

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant