CN115794999B - Patent document query method based on diffusion model and computer equipment - Google Patents

Patent document query method based on diffusion model and computer equipment Download PDF

Info

Publication number
CN115794999B
CN115794999B CN202310048755.8A CN202310048755A CN115794999B CN 115794999 B CN115794999 B CN 115794999B CN 202310048755 A CN202310048755 A CN 202310048755A CN 115794999 B CN115794999 B CN 115794999B
Authority
CN
China
Prior art keywords
diffusion
model
diffusion model
retrieval
patent documents
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310048755.8A
Other languages
Chinese (zh)
Other versions
CN115794999A (en
Inventor
尤元岳
徐青伟
严长春
裴非
范娥媚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Xinghe Zhiyuan Technology Co ltd
Zhiguagua Tianjin Big Data Technology Co ltd
Original Assignee
Zhiguagua Tianjin Big Data Technology Co ltd
Beijing Zhiguquan Technology Service Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhiguagua Tianjin Big Data Technology Co ltd, Beijing Zhiguquan Technology Service Co ltd filed Critical Zhiguagua Tianjin Big Data Technology Co ltd
Priority to CN202310048755.8A priority Critical patent/CN115794999B/en
Publication of CN115794999A publication Critical patent/CN115794999A/en
Application granted granted Critical
Publication of CN115794999B publication Critical patent/CN115794999B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a patent document query method based on a diffusion model and computer equipment, and aims to solve the problem that the completeness and accuracy of the existing patent retrieval are not ideal. The method comprises the steps of obtaining a plurality of keywords through word segmentation for a short text input by a user, and respectively sending the keywords into three diffusion models for diffusion generation, wherein clusters of the keywords in word segmentation results are jointly used as control signals of the diffusion models to limit the diffusion generation direction; the training corpora of the three diffusion models are respectively derived from the abstract, the claim and the specification and are used for correspondingly generating sentences similar to the sentence expression forms of the abstract, the claim and the specification; after retrieval, performing weighted integration on the three groups of patent documents, and selecting and outputting a plurality of patent documents with the highest similarity after weighting as the intention retrieval result of the user; therefore, the retrieval result is comprehensive and more accords with the real retrieval intention of the user, and the completeness and the accuracy of patent retrieval are improved.

Description

Patent document query method based on diffusion model and computer equipment
Technical Field
The application belongs to the technical field of document retrieval, and particularly relates to a patent document query method and computer equipment.
Background
Patent retrieval is used for patent duplication checking and infringement detection, which is a key core link in the patent application and right maintenance process, and how to realize accurate and efficient retrieval becomes an important content in patent system construction.
At present, a common patent retrieval method is generally realized based on matching ranking between a retrieval keyword phrase input by a user and a patent text, and particularly for scenes such as 'simple retrieval', 'semantic retrieval', and the like, multi-topic association possibly exists in the retrieval keyword input by the user, so that a short text input by the user cannot completely express the real retrieval intention of the user, the limited information amount of the short text is not matched with the rich semantic content of a patent document, and the integrity and the accuracy of final patent retrieval are not ideal.
Meanwhile, the traditional query expansion is realized by using similar word lists, word vectors and the like in the general field, however, the similar words in the general field cannot effectively capture the semantic similarity between the professional terms in the patent field. The methods cannot adapt to diversity efficient retrieval in a patent retrieval dynamic unknown (Zero Shot) scene, and the retrieval text automatically generated by patent retrieval query expansion has poor effect on improving the overall accuracy of retrieval.
Disclosure of Invention
The application provides a patent document query method based on a diffusion model and computer equipment, and aims to solve the problem that the completeness and accuracy of the current patent retrieval are not ideal.
Therefore, the following technical scheme is provided in the application:
a patent document query method based on a diffusion model comprises the following steps:
receiving text content input by a user;
if the text content input by the user through retrieval exceeds a preset length threshold, segmenting the text content, and then respectively sending the segmentation results into three diffusion models for diffusion generation, wherein clusters of all keywords in the segmentation results are jointly used as control signals of the diffusion models to limit the generation direction of the diffusion models; the three diffusion models are respectively marked as a first diffusion model, a second diffusion model and a third diffusion model, and training corpuses of the three diffusion models are respectively derived from an abstract, a claim and a specification and are used for correspondingly generating sentences similar to the abstract, the claim and the specification sentence expression form;
the sentences generated by the three diffusion models are sent into a retrieval system, and the abstract, the claim and the specification are respectively and correspondingly taken as retrieval ranges to retrieve patent documents, so that three groups of patent documents are obtained;
and performing weighted integration on the three groups of patent documents, and selecting a plurality of weighted patent documents with the highest similarity as the intention retrieval result of the user and outputting the intention retrieval result.
Optionally, the patent document query method further includes:
if the text content input by the user search does not exceed the preset length threshold, the text content input by the user search is directly sent to a search system, and patent documents are searched respectively by taking the abstract, the claim and the specification as search ranges to obtain three groups of patent documents.
Alternatively, the three sets of patent documents are of the same size. Of course, it may be different.
Preferably, the training method of the three diffusion models comprises the following steps:
gradually adding noise into the training corpus, continuously destroying corpus information, and storing corpus information in each step of destruction process until the original corpus information is destroyed to be completely random Gaussian noise, wherein the process is marked as a noise adding process; and then, denoising the completely random Gaussian noise, and continuously denoising by using a generative model by using the damaged corpus information stored in the denoising process as tag data to finally obtain original corpus information, so that the generative model learns the capacity of generating corresponding corpus through the denoising process.
Optionally, the generating model adopts a Transformer model or a GPT model.
Preferably, the method for generating the corpus includes:
extracting sentences from the abstract, the claim and the specification of the published patent document respectively, and recording the sentences as a first sentence, a second sentence and a third sentence;
and performing word segmentation on the first sentence, the second sentence and the third sentence by adopting a text word segmentation device respectively, wherein the corresponding word segmentation result is a training corpus used for the first diffusion model, the second diffusion model and the third diffusion model.
Preferably, in the three diffusion models, each diffusion model performs a diffusion generation process, which specifically includes:
performing word segmentation and word stop removal on text contents input by a user to obtain a plurality of keywords;
respectively searching a domain word list containing each keyword; the field word list is generated in advance based on a clustering algorithm;
and taking other words in the domain word list to which each keyword belongs as target words with similar semantics to the keywords, training a classifier corresponding to the class of the domain word list to obtain the probability of the diffusion model for each target word in the domain word list, further performing gradient updating on the hidden variables of the diffusion model, repeating multi-step diffusion, and mapping the hidden variables generated finally to texts through a softmax function to obtain sentences in the control direction of the keywords.
Preferably, the retrieval system performs similarity calculation on the sentence generated by the first diffusion model and the abstract text vector of the patent document, performs similarity calculation on the sentence generated by the second diffusion model and the claim text vector of the patent document, performs similarity calculation on the sentence generated by the third diffusion model and the specification text vector of the patent document, and returns N patent documents with the highest similarity respectively by adopting a bm25 model or bert model word vector representation mode.
Computer equipment comprising a memory and a processor, wherein the memory stores a computer program, and is characterized in that the processor implements the steps of the patent document inquiry method based on the diffusion model when executing the computer program.
A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the above-mentioned diffusion model-based patent document query method.
The application has at least the following beneficial effects:
the method comprises the steps of obtaining a plurality of keywords through word segmentation for a short text input by a user, and respectively sending the keywords into three diffusion models for diffusion generation, wherein clusters of the keywords in word segmentation results are jointly used as control signals of the diffusion models to limit the generation direction of the diffusion models; the training corpora of the three diffusion models are respectively derived from the abstract, the claim and the specification and are used for correspondingly generating sentences similar to the sentence expression forms of the abstract, the claim and the specification; sending the documents into a retrieval system for retrieval to obtain three groups of patent documents, performing weighted integration on the three groups of patent documents, and selecting and outputting a plurality of patent documents with the highest similarity after weighting as the intention retrieval result of the user; therefore, the retrieval result is more comprehensive and more accords with the real retrieval intention of the user, and the completeness and the accuracy of patent retrieval are improved.
Drawings
FIG. 1 is a schematic diagram illustrating a basic principle of a patent document query method based on a diffusion model according to an embodiment of the present application;
FIG. 2 is a diagram illustrating a training process of a diffusion model in an embodiment of the present application (taking a summary diffusion model as an example);
FIG. 3 is a schematic diagram of a training method for three diffusion models according to an embodiment of the present application;
FIG. 4 is a diagram illustrating a sentence generation process of a diffusion model in an embodiment of the present application (taking a summary diffusion model as an example);
FIG. 5 is a diagram illustrating a sentence generation method for three diffusion models according to an embodiment of the present application;
FIG. 6 is a diagram illustrating an extended search and integration process according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
In one embodiment, an application scenario, such as a user, enters a search statement on a website, APP, that provides patent search services, in a "simple search" manner (typically a search box presented on a home page) or in a manner of selecting "semantic search" (supporting longer text content).
As shown in fig. 1, there is provided a patent document query method based on a diffusion model, comprising the following steps:
receiving text content input by a user;
if the text content input by the user for searching does not exceed the preset length threshold, directly sending the text content input by the user for searching into a searching system (or carrying out appropriate preprocessing), and respectively searching for patent documents by taking the abstract, the claim and the specification as searching ranges to obtain three groups of patent documents;
if the text content input by the user in the search exceeds a preset length threshold, segmenting the text content, and then respectively sending the segmentation result into three diffusion models (an abstract diffusion model, a claim diffusion model and a specification diffusion model) for diffusion generation, wherein the clusters of all keywords in the segmentation result are jointly used as control signals of the diffusion models to limit the generation direction of the diffusion models; the three diffusion models can be abbreviated as a first diffusion model, a second diffusion model and a third diffusion model, and training corpuses of the three diffusion models are respectively derived from an abstract, a claim and a specification and are used for correspondingly generating sentences which are similar to the abstract, the claim and the specification sentence expression form; the sentences generated by the three diffusion models are sent into a retrieval system, and the abstract, the claim and the specification are respectively and correspondingly taken as retrieval ranges to retrieve patent documents;
note that the "claims" and "claims" mentioned here are different concepts, the former emphasizes the terms of the claims (each claim expresses an independent meaning and can generate a text vector; the same semantics, the statement expression form of the claims may be different from the statement expression form in the abstract and the specification), and the latter is one of the basic compositions of the patent document (the target range for similarity calculation);
in addition, the expression form similarity of the sentence is a concept different from the similarity calculation in patent document retrieval, the former focuses on the expression form of the sentence and aims to make the sentences generated by the three models respectively more like abstract sentences, claim sentences and specification sentences, and the latter focuses on semantic approximation;
performing weighted integration on the three groups of patent documents, and selecting K patent documents with highest similarity after weighting as intention retrieval results of the user and outputting the K patent documents; k is the preset space number and is less than 3N.
Specifically, the above-mentioned retrieval of patent documents with the abstract, the claims and the specification as retrieval ranges respectively includes performing similarity calculation between the sentence generated by the first diffusion model and the abstract text vector of the patent document in the patent library, performing similarity calculation between the sentence generated by the second diffusion model and the claim text vector of the patent document in the patent library, performing similarity calculation between the sentence generated by the third diffusion model and the specification text vector of the patent document in the patent library, and returning N patent documents with the highest similarity respectively to obtain three groups of patent documents (3N patent documents). Of course, the number of three sets of patent documents may also vary, for example: the first group of patent documents is set to be 100, the second group of patent documents is set to be 80, the third group of patent documents is set to be 50, and finally, the total number of the three groups of patent documents is 230; and (4) performing weighted integration on the three groups of patent documents, and selecting 150 patents with the highest similarity after weighting.
Among them, the technology (approximate search) for searching for patent documents using a specific search range (abstract, claim, or specification) as a search range itself is a prior art in the field.
The diffusion model is a deep generation model, query expansion automatic generation with input as a condition has the advantages of strong robustness, high sampling efficiency, semantic approximation, sample diversity and the like, and meanwhile, the generated result has certain interpretability. The incremental generation of the query contents through the diffusion model can enable the retrieval result to be more completely covered and improve the recall rate. The cost of missing related patents is high, and the recall index is very important. In addition, the new word and term in the patent is high in updating speed, and the diffusion model can generate interpretable expansion query with wide diversity coverage, so that the intuitive understanding of a user is facilitated. In the whole process, the diffusion model plays an important step, and in the embodiment, a diffusion model which can be controlled by the keywords is mainly trained, and the controllable diffusion model is applied to an extended retrieval system in the patent field.
The purpose of the first training part is to make the diffusion model obtain the capability of generating sentences in random text directions through training data and training methods, for example, through training of training data of various fields, the diffusion model can generate sentences in various field directions, including fields of artificial intelligence, computers, traffic and the like, but the diffusion model of the current step has no way to control the field direction of specifically generated sentences, the model can generate sentences in artificial intelligence fields at this step, and can also generate sentences in computer fields, which are completely random, and the first step only makes the model have the capability of generating sentences in all fields. In this step, in order to make the result of the patent extension search more accurate, the corpus of the training diffusion model can be used to train three models respectively using the corpus of the abstract, the claim and the specification, so that the sentences generated by the three models can be more like the sentences in the abstract, the claim and the specification.
The method for training the diffusion model in the embodiment mainly comprises the following steps:
extracting sentences from the abstract, the claim and the specification of the published patent document respectively, wherein the sentences can be marked as a first sentence, a second sentence and a third sentence; respectively segmenting words of the first sentence, the second sentence and the third sentence by adopting a text word segmentation device, wherein the corresponding word segmentation result is training corpus used for the first diffusion model, the second diffusion model and the third diffusion model;
gradually adding noise into corresponding training corpora according to each diffusion model, continuously destroying the corpus information, and storing the corpus information of each step of the destroying process until the original corpus information is destroyed to become completely random Gaussian noise, wherein the process is marked as a noise adding process; and then, denoising the completely random Gaussian noise, and continuously denoising by using a generative model by using the destroyed corpus information stored in the denoising process as tag data to finally obtain original corpus information, so that the model learns the capacity of generating corresponding corpus through the denoising process.
The second step generation part limits the generation direction of the diffusion model through the key words input by the user, so that the diffusion model can generate sentences with fixed text directions. For example, if the user wants to expand and search a sentence in the field of artificial intelligence, and thus inputs a keyword of "artificial intelligence", then in the second generation process, the model gradually migrates the self-generated sentence direction to the field of artificial intelligence according to the keyword input by the user, and finally generates a sentence in the field of artificial intelligence. The sentences in the artificial intelligence field can be used as search conditions with richer contents for the expanded query of the keywords of the artificial intelligence input by the user. Based on the models trained by the abstract, the claim and the specification in the first step, the keyword artificial intelligence input by the user is expanded and generated, and the three diffusion models can generate the abstract, the claim and the specification sentences in the three fields of artificial intelligence.
The specific process of the diffusion generation by the diffusion model in this embodiment mainly includes:
performing word segmentation and word stop removal on text contents input by a user to obtain a plurality of keywords; respectively searching a domain word list containing each keyword (each domain word list is generated in advance based on a clustering algorithm); and taking other words in the domain word list to which each keyword belongs as target words with similar semantics to the keywords, training a classifier corresponding to the class of the domain word list to obtain the probability of the diffusion model for each target word in the domain word list, further performing gradient updating on the hidden variables of the diffusion model, repeating multi-step diffusion, and mapping the hidden variables generated finally to texts through a softmax function to obtain sentences in the control direction of the keywords.
And a third step of searching the abstract, the claim and the instruction sentence of the three artificial intelligence fields obtained by the generation part respectively for the abstract, the claim and the instruction sentence of the patent document in the searching system. That is, the abstract sentences generated by the diffusion model in the artificial intelligence field are compared with the abstract information of the patent documents in the retrieval system for retrieval, and the retrieval system returns the former N patents based on the similarity between the abstract in the patent documents and the sentences generated by the diffusion model. Similarly, the claim sentences and the specification sentences respectively search the claim and the specification of the patent documents in the search engine, and also respectively return the first N patents with the similarity. And then carrying out weighted statistics on the 3N patents, finding out the first K patents which are most relevant to the artificial intelligence input by the user, taking the first K patents as keywords of the artificial intelligence input by the user, and returning the results of the K patents by the retrieval system.
Thus, the whole process can be summarized in the following three steps:
1. and a diffusion model training process, wherein the diffusion model obtains the capability of generating sentences in random directions, and the diffusion model is trained by the linguistic data of different patent parts respectively, so that the diffusion model can generate sentences of corresponding patent parts. The process does not enter the whole process of the extended search and is a precondition of the function of the extended search.
2. The process of generating sentences by the diffusion model is a part of an extended retrieval process, namely, the diffusion model gradually generates sentences in the keyword direction input by a user, and the sentences of the abstract, the claims and the specification in the keyword direction are respectively generated according to the trained diffusion model aiming at different parts of the patent.
3. And in the expanding retrieval and integration process, the generated sentences in the same field of the three parts are transmitted to a retrieval system to retrieve the three parts of the patent respectively, and the retrieval result is weighted and counted to find out the most similar previous K patents, so that the process of expanding retrieval is realized.
These three steps are described in further detail below.
1. A diffusion model training process, aiming to make the diffusion model obtain the capability of generating sentences in random domain directions. The diffusion model training step comprises the following steps: corpus construction, abstract diffusion model training, claim diffusion model training and specification diffusion model training.
Step 1, corpus construction: because of the extended search aiming at the patent field and the final aim that the model can generate sentences related to the abstract, the claim and the specification, the abstract, the claim and the specification text content of the patent are extracted from the collected patent documents, and the abstract, the claim and the specification content are divided by periods, semicolons and the like, and the divided sentences are used as the preliminary training corpora of the three diffusion models. Three different diffusion models are prepared and trained by the three parts of training corpora.
In order to avoid repeated explanation, the following processes take abstract sentence corpora as an example, and a diffusion model related to an abstract is trained, and the diffusion model of the claims and the specification is also a precondition which needs to be trained as a next generation process.
Step 2, abstract diffusion model training: as shown in fig. 2, the overall idea of the training process is to gradually add noise to the collected abstract corpus, continuously destroy the information of the whole corpus, and store the information of the destruction process at each step until the corpus information is destroyed to be completely random gaussian noise. This process is called forward propagation, i.e. the noising process. After the noise adding process, random Gaussian noise is obtained, noise reduction is needed to be carried out on the Gaussian noise continuously, the stored damaged corpus information in the noise adding process is used as tag data, noise reduction is carried out continuously by using a generative model such as a Transformer model or a GPT model, finally the noise reduction is carried out to obtain initial corpus information, and the generative model can learn the capacity of generating corresponding corpus through the noise reduction process. The specific training process is as shown in fig. 2:
(1) The abstract sentences acquired by the corpus constructing part are used as input data of the diffusion model, and in the example, the abstract sentences 'an artificial intelligent automobile, which comprise an automatic route searching method and a danger predicting module' are used as input texts. And then segmenting the input text, wherein the text segmenter can perform segmentation by pre-training or directly using a trained segmenter, such as a hand segmenter or a jieba segmenter. The obtained word segmentation result is w, wherein w is a word list after the sentence of the input data is segmented. Assuming that the sentence of the input data has n words in total after word segmentation, the input data is divided into a plurality of words
Figure SMS_1
The word segmentation result in this example may be:
Figure SMS_2
(2) The word segmentation result w is transmitted to a word vector embedding layer EMB, so that discrete words are mapped to a continuous space, and the obtained word embedding result is as follows:
Figure SMS_3
n words are mapped into n d-dimensional vectors.
(3) The embedded word vectors are then transformed through a Markov chain into hidden variables in a diffusion model and through a probabilistic model
Figure SMS_8
Generates a corresponding hidden variable->
Figure SMS_5
Wherein
Figure SMS_14
Is given>
Figure SMS_6
Then, is generated @, via word vector encoding and Markov chain>
Figure SMS_15
The probability of (c).
Figure SMS_9
Represented by>
Figure SMS_16
Is mean value, is based on>
Figure SMS_11
Is normally distributed with variance, and>
Figure SMS_18
the value of (c) is the value sampled in the normal distribution. In the reverse process, a trainable approximation model step is added that will ≥>
Figure SMS_7
And mapping back the original text word segmentation content again, wherein the mapping relation is as follows: />
Figure SMS_12
Wherein
Figure SMS_4
Is a softmax profile and->
Figure SMS_13
Has the meaning that at a given->
Figure SMS_19
On the premise that a softmax distribution results in +>
Figure SMS_20
The probability of (c). In the following, for convenience of understanding, the term->
Figure SMS_10
As a probabilistic representation of feed-forward propagation, will->
Figure SMS_17
As a probabilistic representation of the inverse denoising process.
(4) During the feedforward propagation process, intermediate hidden variables are constructed
Figure SMS_23
This feed forward propagation is progressively pick->
Figure SMS_25
Adding Gaussian noise until T step is added, and at T step, based on the comparison result, determining whether the signal is greater than or equal to>
Figure SMS_27
Close to Gaussian noise, the transfer of each step is based on>
Figure SMS_21
To/>
Figure SMS_24
Are all by>
Figure SMS_26
Is sampled from the normal distribution of (a). Wherein->
Figure SMS_28
For adding the amount of Gaussian noise in step t, <' >>
Figure SMS_22
The feed forward process q is a hyperparameter and may contain no trainable parameters and may define a training objective, including generating noisy data according to a predefined feed forward process q and training a model to reverse the process and reconstruct the data.
(5) In reverse propagation, the diffusion model passes through the reverse channel
Figure SMS_31
Gradually denoising the Gaussian noise to gradually reconstruct
Figure SMS_40
. The whole process is that the model is judged to be based on the result of the reconstruction process>
Figure SMS_47
Gaussian noise begins to be de-noised gradually, so that a series of hidden variables ≥ are generated>
Figure SMS_30
So that a sample close to the target distribution pick>
Figure SMS_38
. Has an initial state of->
Figure SMS_44
And the de-noising process of each step->
Figure SMS_52
To>
Figure SMS_33
Are all determined by>
Figure SMS_42
Is obtained, wherein>
Figure SMS_50
And &>
Figure SMS_54
Can pass through->
Figure SMS_35
Or->
Figure SMS_41
And calculating and learning. Wherein the data of this training process is that the input is->
Figure SMS_48
The output is based on the fact that it is taken during the forward process of diffusion>
Figure SMS_56
,/>
Figure SMS_32
The models are used to learn the distribution of the feedforward propagationThe value and the variance of the measured values, wherein in/on>
Figure SMS_37
For example, in the de-noising process,
Figure SMS_45
is inputted as->
Figure SMS_51
And the output is the passing of the denoising process>
Figure SMS_29
Predicted->
Figure SMS_43
And this->
Figure SMS_49
To approach ^ in feed-forward propagation>
Figure SMS_55
Based on a feed-forward propagation>
Figure SMS_34
And de-noising process>
Figure SMS_39
Is taken as->
Figure SMS_46
Is updated by back propagation>
Figure SMS_53
The parameter (2) of (1), thereby making->
Figure SMS_36
The mean and variance in the current distribution are learned.
(6) Diffusion model by maximizing data
Figure SMS_59
Is trained and the normalized target is ≥ based on the likelihood value>
Figure SMS_60
Is lower bound, the loss function of the diffusion model becomes, therefore:
Figure SMS_63
however, this training objective is not stable and requires a great deal of optimization skill, so a simple alternative objective was devised
Figure SMS_58
Is expanded and re-weighted to obtain a mean square error loss, so the loss function of the diffusion model becomes->
Figure SMS_62
Wherein
Figure SMS_64
Is the posterior probability->
Figure SMS_65
Is close to gaussian noise.
Figure SMS_57
Is predicted by a neural network>
Figure SMS_61
Is measured.
(7) Mapping the word vectors in the step (3) into hidden variables
Figure SMS_66
Is greater or less>
Figure SMS_67
And to be reconstructed
Figure SMS_68
Process of remapping words back>
Figure SMS_69
The loss function is carried into the step (6), and finally the end is obtainedTraining loss function of the end:
Figure SMS_70
(ii) a It can also be optimized as: />
Figure SMS_71
Figure SMS_72
Both training loss functions are initially equivalent;
and training the diffusion model through a loss function, and performing back propagation to complete the training of the single diffusion model.
Step 3, in order to improve the retrieval accuracy, the claim diffusion model and the specification diffusion model are retrained according to the processes (1) to (7) in the step 2, and the corpus information corresponds to sentences of the claim and the specification; three diffusion models were thus obtained, as shown in fig. 3.
2. And the diffusion model generates a sentence process, wherein the process aims to perform content diffusion generation in the keyword direction on the trained three diffusion models respectively aiming at the keywords input by the user, so as to generate sentences which correspond to the three parts of the patent and belong to the keyword field direction. The diffusion model generation sentence process can be divided into a diffusion model preparation, a diffusion model generation abstract sentence, a diffusion model generation claim and a diffusion model generation specification sentence.
Step 1, diffusion model preparation: firstly, a domain vocabulary is pre-trained or a pre-trained domain vocabulary is used, for example, keywords such as artificial intelligence and neural network may be included in the artificial intelligence domain. This individual domain vocabulary is then viewed as a bag of words for nbow. The word list in the field can encode each word into a vector after segmenting words, removing duplication and removing stop words of text contents of all Chinese patents, and the process of encoding the vector can encode the words into word vectors by using an existing word vector library or by using a bert model and the like. Clustering the words contained in all Chinese patents by using the encoded word vectors in a KNN or Kmeans clustering manner to obtain clusters, namely clustering words in the graph 4, wherein the individual clustering word lists obtained by clustering are regarded as field word lists.
And 2, generating the abstract sentence by the diffusion model. The main purpose of this step is to control the abstract diffusion model to generate sentences related to the direction of the phrases input by the user through the phrases input by the user. The first input is a keyword phrase input by a user, the keyword phrase is used as control information to control a diffusion model to generate a sentence text in a keyword direction, the second input is random Gaussian noise, and the diffusion model is based on the Gaussian noise and is used for continuously denoising the noise so as to generate a fluent sentence.
The overall flow of the generation phase is as shown in fig. 4: firstly, a phrase text input by a user, such as an 'artificial intelligence image' input by the user, is input, the user possibly wants to search contents in some artificial intelligence directions and relates to some artificial intelligence image recognition or image processing methods, two keywords of 'artificial intelligence' and 'image' are obtained after the phrase input by the user is subjected to word segmentation and word deactivation, the two keywords are respectively searched in a cluster, and the cluster containing the two keywords is searched; as shown in fig. 4, the cluster 1 includes artificial intelligence, the cluster 2 includes images, the cluster 1 and the cluster 2 are used as control signals of a diffusion model to control the diffusion model to generate words in the two clusters, and the control process is to train a classifier, the class of the classifier is the class of the clusters, the denoising result of each step of the diffusion model is predicted by using one classifier, the prediction result has a certain deviation from the keyword clustering result input by a user, then a loss function is formed by the deviation, and the hidden variable of the current step of the abstract diffusion model is modified in a gradient updating manner in a back propagation manner, and the hidden variable modified by back propagation is more biased to the directions of the cluster 1 and the cluster 2. However, sentences which are fluent and highly related to the cluster 1 and the cluster 2 are not generated at once by one-step diffusion and back propagation, so that the process needs to be repeated for multiple steps, hidden variables are gradually shifted towards the cluster 1 and the cluster 2, and fluent sentences are generated. This step is a hyper-parameter and the invention locates 200 steps. And mapping the generated hidden variables to texts by using a softmax function through diffusion generation and direction migration in 200 steps so as to obtain the generated sentences. The generated sentences are the output of the step, namely, the abstract diffusion model successfully generates sentences which are related to the artificial intelligence and the image and are similar to the abstract. And the specific implementation and formula logic are as follows.
The diffusion model generation stage is from Gaussian noise
Figure SMS_76
De-noising step by step and then generating fluent sentence hidden variable->
Figure SMS_84
And then based on the approximation model in step (3) of the diffusion model training process described previously will +>
Figure SMS_89
And the text sentence is re-mapped, and the process is the generation process of the general diffusion model. However, it can also be seen that the whole process starts with gaussian noise and does not control a randomly generated sentence, so that the general sentence generated by the diffusion model has no way to control the direction of the sentence. Therefore, the embodiment controls the diffusion model to generate the sentence with the keyword direction, i.e. to control the generation direction of the hidden variable in the diffusion model, i.e. to control ≥ r>
Figure SMS_74
The value of the hidden variable. The embodiment is characterized in that the control is realized by a domain vocabulary
Figure SMS_80
The field vocabulary is the cluster vocabulary previously generated in step 1, so that the control process can be represented by the probability formula>
Figure SMS_86
Wherein c represents a control condition, i.e. a keyword, and the probability formula represents the generation of a hidden variable ≥ based on the given keyword>
Figure SMS_92
The probability value of (2). And for each step of the diffusion process a hidden variable->
Figure SMS_75
Is based on the previous step>
Figure SMS_83
Generated by combining with control conditions (keywords) and obtained by Bayesian formula
Figure SMS_90
By the assumption of independent conditions, the method can be simplified into
Figure SMS_96
. To be->
Figure SMS_78
Make->
Figure SMS_81
For example, first->
Figure SMS_87
Is transmitted into a model (usually a Transformer) trained by the diffusion model, and predicts ≥ by this Transformer>
Figure SMS_94
Will then->
Figure SMS_79
Input into a classifier, and predict->
Figure SMS_85
Will then->
Figure SMS_91
Is updated by means of a back-propagation process>
Figure SMS_95
Is taken into consideration, at this time->
Figure SMS_73
It is shifted one step towards the target direction. Then the offset->
Figure SMS_82
Again input to the transform to predict >>
Figure SMS_88
Repeating the above steps by analogy until the T step is repeated to obtain->
Figure SMS_93
Will >>
Figure SMS_77
And performing text prediction through softmax to obtain a corresponding text result. The text result obtained at this time is the text result obtained after the control by the target direction.
Thus, for the t-th step of the diffusion process, it can be updated by the following formula
Figure SMS_97
The value of (c):
Figure SMS_98
wherein
Figure SMS_99
Is obtained by a diffusion model, and the main function of the diffusion model is to generate fluent texts and to combine the texts with the words>
Figure SMS_100
Is obtained by a classifier of a neural network, the main function of the classifier is to generate texts in the direction of control conditions (keywords), and in addition, a ^ or a 'real' text is added for generating more fluent texts>
Figure SMS_101
The hyper-parameter balances the fluency of the text and the direction of the text, so this gradient update may become
Figure SMS_102
Figure SMS_103
As described hereinbefore
Figure SMS_106
Can be obtained by a well-trained diffusion model in the diffusion model training process, and->
Figure SMS_109
A classifier is required to obtain the corresponding probability value. For +>
Figure SMS_113
In fact, given a hidden variable>
Figure SMS_105
Judging the hidden variable as the control condition>
Figure SMS_110
The probability of (c). Usually, a classifier needs to be trained to obtain the probability value, but since there are too many keywords that may appear, it is difficult to output hidden variables ≥ by using all the keywords as labels>
Figure SMS_114
The probability value may correspond to a keyword, so this embodiment adopts an nbow model to calculate the probability value, and the domain vocabulary obtained in step 1 is taken as nbow. Firstly, finding out which domain or domains the keyword belongs to, and then taking the words in the domains as target words similar to the semantics of the keyword, so that the probability value is the probability value of the diffusion language model to each word currently in the domain word list, adding the probability values and taking the logarithm: />
Figure SMS_116
Wherein->
Figure SMS_104
For a word in the domain word list, p is generated during reconstruction>
Figure SMS_108
Probability value of a word. From this step, it is therefore possible to ascertain the current hidden variable +>
Figure SMS_112
And inputting the probability value obtained by the classifier. Carrying out gradient updating on hidden variables of the diffusion model through the field word list and the probability value of the diffusion model, and then carrying out the next step of->
Figure SMS_115
Is a hidden vector that is closer to the control condition. Generated by diffusing T steps>
Figure SMS_107
Is the final hidden variable, the resulting->
Figure SMS_111
And inputting the text into an approximate model in the diffusion model to obtain corresponding sentence text so as to generate a corresponding sentence. T is a hyper-parameter that may be set to 200.
And 3, generating the claim and the instruction sentence by the diffusion model, and repeatedly generating the claim diffusion model and the instruction diffusion model according to the content in the step 2 to finally obtain the abstract, the claim and the instruction sentence related to the artificial intelligence and the image, as shown in fig. 5.
3. Extended search and integration, as shown in FIG. 6:
expanding and retrieving: the abstracts, the claims and the instruction sentences generated by the diffusion model are respectively sent to a retrieval system for retrieval, and are compared in the patent abstracts, the claims and the instruction sentences in a patent library in the retrieval system. Taking the abstract sentence as an example, the abstract sentence is searched and compared with the abstract part of the patent in the search system, and the previous N patents similar to the input abstract sentence are returned. The retrieval system can calculate the similarity of the sentences generated in the diffusion model and the abstract of the patent, the claims and the text vector of the specification part in a bm25 model or bert model word vector representation mode and return topN patents with higher similarity.
Integration: and selecting the top K patents with the highest weight as the extended retrieval by weighting the similarity of the three acquired 3N patents. The weighting coefficient may be set as required, for example, the weighting method may be that the returned patent similarity of each part is assigned with the same weight, and then the first K patent documents are the first 3N patents which are ranked from high to low in patent similarity.
The embodiment performs diffusion generation on the query key words input by the user in a short way, so as to generate longer and more diversified sentences, the sentences can also correspond to the parts of the abstract, the claim and the specification of the patent, so that the retrieval system can perform more accurate retrieval, similarity comparison is performed on the three parts of the patent, therefore, the patent retrieval system can obtain more accurate information of the user, the retrieval system can retrieve the content which the user wants to obtain, the retrieval text which is similar to the abstract of the patent is generated by combining with iterative supplementation of a user interaction mechanism, patent retrieval is realized by combining with a multi-stage text similarity matching sorting algorithm, the defect of lack of fine retrieval in the prior art is overcome, the accuracy of patent label retrieval is improved, and the purposes of releasing manpower, reducing cost and improving efficiency are achieved.
The embodiment can be realized by software, and the product form can be a computer device loaded with the corresponding software and can also be a computer readable storage medium. For example:
computer equipment, comprising a memory and a processor, wherein the memory stores a computer program, and is characterized in that the processor realizes the steps of the patent document inquiry method based on the diffusion model when executing the computer program.
A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the above-mentioned diffusion model-based patent document query method.
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (8)

1. A patent document query method based on a diffusion model is characterized by comprising the following steps:
receiving text content input by a user;
if the text content input by the user through retrieval exceeds a preset length threshold, segmenting the text content, and then respectively sending the segmentation results into three diffusion models for diffusion generation, wherein clusters of all keywords in the segmentation results are jointly used as control signals of the diffusion models to limit the generation direction of the diffusion models; the three diffusion models are respectively marked as a first diffusion model, a second diffusion model and a third diffusion model, and training corpuses of the three diffusion models are respectively derived from an abstract, a claim and a specification and are used for correspondingly generating sentences which are similar to the abstract, the claim and the specification sentence expression form;
the sentences generated by the three diffusion models are sent into a retrieval system, and the abstract, the claim and the specification are respectively and correspondingly taken as retrieval ranges to retrieve patent documents, so that three groups of patent documents are obtained;
performing weighted integration on the three groups of patent documents, and selecting and outputting a plurality of weighted patent documents with the highest similarity as intention retrieval results of the user;
the training method of the three diffusion models comprises the following steps: gradually adding noise into the training corpus, continuously destroying corpus information, and storing corpus information in each step of destruction process until the original corpus information is destroyed to be completely random Gaussian noise, wherein the process is marked as a noise adding process; then, denoising the completely random Gaussian noise, continuously denoising by using a generative model through taking the damaged corpus information stored in the denoising process as tag data, and finally obtaining original corpus information, so that the generative model learns the capacity of generating corresponding corpus through the denoising process;
in the three diffusion models, each diffusion model performs a diffusion generation process, which specifically includes:
performing word segmentation and word stop removal on text contents input by a user to obtain a plurality of keywords;
respectively searching a domain word list containing each keyword; the field word list is generated in advance based on a clustering algorithm;
and taking other words in the domain word list to which each keyword belongs as target words with similar semantics to the keywords, training a classifier corresponding to the class of the domain word list to obtain the probability of the diffusion model for each target word in the domain word list, further performing gradient updating on the hidden variables of the diffusion model, repeating multi-step diffusion, and mapping the hidden variables generated finally to texts through a softmax function to obtain sentences in the control direction of the keywords.
2. The diffusion model-based patent document query method according to claim 1, further comprising:
if the text content input by the user search does not exceed the preset length threshold, the text content input by the user search is directly sent to a search system, and patent documents are searched respectively by taking the abstract, the claim and the specification as search ranges to obtain three groups of patent documents.
3. The diffusion model-based patent document query method of claim 2, wherein the three groups of patent documents have the same number of sections.
4. The diffusion model-based patent document query method according to claim 1, wherein the generative model is a Transformer model or a GPT model.
5. The method for querying patent documents based on diffusion model according to claim 1, wherein said generating method of training corpus comprises:
extracting sentences from the abstract, the claim and the specification of the published patent document respectively, and recording the sentences as a first sentence, a second sentence and a third sentence;
and respectively segmenting the first sentence, the second sentence and the third sentence by adopting a text word segmentation device, wherein the corresponding word segmentation result is the training corpus used for the first diffusion model, the second diffusion model and the third diffusion model.
6. The diffusion model-based patent document query method according to claim 1,
the retrieval system adopts a bm25 model or a bert model word vector representation mode to calculate the similarity between sentences generated by the first diffusion model and abstract text vectors of patent documents, calculate the similarity between sentences generated by the second diffusion model and claim text vectors of the patent documents, calculate the similarity between sentences generated by the third diffusion model and specification text vectors of the patent documents, and return N patent documents with the highest similarity.
7. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor when executing the computer program implements the steps of the diffusion model based patent document query method according to any one of claims 1 to 6.
8. A computer-readable storage medium, on which a computer program is stored, wherein the computer program, when being executed by a processor, implements the steps of the diffusion model-based patent document query method according to any one of claims 1 to 6.
CN202310048755.8A 2023-02-01 2023-02-01 Patent document query method based on diffusion model and computer equipment Active CN115794999B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310048755.8A CN115794999B (en) 2023-02-01 2023-02-01 Patent document query method based on diffusion model and computer equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310048755.8A CN115794999B (en) 2023-02-01 2023-02-01 Patent document query method based on diffusion model and computer equipment

Publications (2)

Publication Number Publication Date
CN115794999A CN115794999A (en) 2023-03-14
CN115794999B true CN115794999B (en) 2023-04-11

Family

ID=85429384

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310048755.8A Active CN115794999B (en) 2023-02-01 2023-02-01 Patent document query method based on diffusion model and computer equipment

Country Status (1)

Country Link
CN (1) CN115794999B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115951883B (en) * 2023-03-15 2023-05-23 日照市德衡信息技术有限公司 Service component management system of distributed micro-service architecture and method thereof
CN116431838B (en) * 2023-06-15 2024-01-30 北京墨丘科技有限公司 Document retrieval method, device, system and storage medium
CN116501899A (en) * 2023-06-30 2023-07-28 粤港澳大湾区数字经济研究院(福田) Event skeleton diagram generation method, system, terminal and medium based on diffusion model
CN117251539B (en) * 2023-08-11 2024-04-02 北京中知智慧科技有限公司 Patent intelligent retrieval system using generative artificial intelligence
CN117131187B (en) * 2023-10-26 2024-02-09 中国科学技术大学 Dialogue abstracting method based on noise binding diffusion model
CN117421393B (en) * 2023-12-18 2024-04-09 知呱呱(天津)大数据技术有限公司 Generating type retrieval method and system for patent

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104765779A (en) * 2015-03-20 2015-07-08 浙江大学 Patent document inquiry extension method based on YAGO2s
CN107609142A (en) * 2017-09-21 2018-01-19 合肥集知网知识产权运营有限公司 A kind of big data patent retrieval method based on Extended Boolean Retrieval model

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106156111B (en) * 2015-04-03 2021-10-19 北京中知智慧科技有限公司 Patent document retrieval method, device and system
CN112036177A (en) * 2020-07-28 2020-12-04 中译语通科技股份有限公司 Text semantic similarity information processing method and system based on multi-model fusion
CN112507109A (en) * 2020-12-11 2021-03-16 重庆知识产权大数据研究院有限公司 Retrieval method and device based on semantic analysis and keyword recognition
CN112667800A (en) * 2020-12-21 2021-04-16 深圳壹账通智能科技有限公司 Keyword generation method and device, electronic equipment and computer storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104765779A (en) * 2015-03-20 2015-07-08 浙江大学 Patent document inquiry extension method based on YAGO2s
CN107609142A (en) * 2017-09-21 2018-01-19 合肥集知网知识产权运营有限公司 A kind of big data patent retrieval method based on Extended Boolean Retrieval model

Also Published As

Publication number Publication date
CN115794999A (en) 2023-03-14

Similar Documents

Publication Publication Date Title
CN115794999B (en) Patent document query method based on diffusion model and computer equipment
Xiang et al. A convolutional neural network-based linguistic steganalysis for synonym substitution steganography
CN111046179B (en) Text classification method for open network question in specific field
CN109800437B (en) Named entity recognition method based on feature fusion
CN110263325B (en) Chinese word segmentation system
CN109858041B (en) Named entity recognition method combining semi-supervised learning with user-defined dictionary
CN107832306A (en) A kind of similar entities method for digging based on Doc2vec
CN113626589B (en) Multi-label text classification method based on mixed attention mechanism
CN116662582B (en) Specific domain business knowledge retrieval method and retrieval device based on natural language
CN112749274B (en) Chinese text classification method based on attention mechanism and interference word deletion
Shalaby et al. An lstm approach to patent classification based on fixed hierarchy vectors
CN114428850B (en) Text retrieval matching method and system
Li et al. Chinese text classification based on hybrid model of CNN and LSTM
CN114781375A (en) Military equipment relation extraction method based on BERT and attention mechanism
CN112560438A (en) Text generation method based on generation of confrontation network
CN111651602A (en) Text classification method and system
Yi et al. Exploring hierarchical graph representation for large-scale zero-shot image classification
CN113962228A (en) Long document retrieval method based on semantic fusion of memory network
Tao et al. News text classification based on an improved convolutional neural network
CN111145914A (en) Method and device for determining lung cancer clinical disease library text entity
Pasad et al. On the contributions of visual and textual supervision in low-resource semantic speech retrieval
CN112925907A (en) Microblog comment viewpoint object classification method based on event graph convolutional neural network
CN116955616A (en) Text classification method and electronic equipment
Yap Text anomaly detection with arae-anogan
CN114547245A (en) Legal element-based class case retrieval method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP01 Change in the name or title of a patent holder

Address after: No. 401-1, 4th floor, podium, building 3 and 4, No. 11, Changchun Bridge Road, Haidian District, Beijing 100089

Patentee after: Beijing Zhiguagua Technology Co.,Ltd.

Patentee after: Zhiguagua (Tianjin) Big Data Technology Co.,Ltd.

Address before: No. 401-1, 4th floor, podium, building 3 and 4, No. 11, Changchun Bridge Road, Haidian District, Beijing 100089

Patentee before: Beijing Zhiguquan Technology Service Co.,Ltd.

Patentee before: Zhiguagua (Tianjin) Big Data Technology Co.,Ltd.

CP01 Change in the name or title of a patent holder
CP03 Change of name, title or address

Address after: No. 401-1, 4th floor, podium, building 3 and 4, No. 11, Changchun Bridge Road, Haidian District, Beijing 100089

Patentee after: Beijing Xinghe Zhiyuan Technology Co.,Ltd.

Country or region after: China

Patentee after: Zhiguagua (Tianjin) Big Data Technology Co.,Ltd.

Address before: No. 401-1, 4th floor, podium, building 3 and 4, No. 11, Changchun Bridge Road, Haidian District, Beijing 100089

Patentee before: Beijing Zhiguagua Technology Co.,Ltd.

Country or region before: China

Patentee before: Zhiguagua (Tianjin) Big Data Technology Co.,Ltd.

CP03 Change of name, title or address
TR01 Transfer of patent right

Effective date of registration: 20240514

Address after: No. 401-1, 4th floor, podium, building 3 and 4, No. 11, Changchun Bridge Road, Haidian District, Beijing 100089

Patentee after: Beijing Xinghe Zhiyuan Technology Co.,Ltd.

Country or region after: China

Address before: No. 401-1, 4th floor, podium, building 3 and 4, No. 11, Changchun Bridge Road, Haidian District, Beijing 100089

Patentee before: Beijing Xinghe Zhiyuan Technology Co.,Ltd.

Country or region before: China

Patentee before: Zhiguagua (Tianjin) Big Data Technology Co.,Ltd.

TR01 Transfer of patent right