CN115794999B

CN115794999B - Patent document query method based on diffusion model and computer equipment

Info

Publication number: CN115794999B
Application number: CN202310048755.8A
Authority: CN
Inventors: 尤元岳; 徐青伟; 严长春; 裴非; 范娥媚
Original assignee: Zhiguagua Tianjin Big Data Technology Co ltd; Beijing Zhiguquan Technology Service Co ltd
Current assignee: Beijing Xinghe Zhiyuan Technology Co ltd; Zhiguagua Tianjin Big Data Technology Co ltd
Priority date: 2023-02-01
Filing date: 2023-02-01
Publication date: 2023-04-11
Anticipated expiration: 2043-02-01
Also published as: CN115794999A

Abstract

The application discloses a patent document query method based on a diffusion model and computer equipment, and aims to solve the problem that the completeness and accuracy of the existing patent retrieval are not ideal. The method comprises the steps of obtaining a plurality of keywords through word segmentation for a short text input by a user, and respectively sending the keywords into three diffusion models for diffusion generation, wherein clusters of the keywords in word segmentation results are jointly used as control signals of the diffusion models to limit the diffusion generation direction; the training corpora of the three diffusion models are respectively derived from the abstract, the claim and the specification and are used for correspondingly generating sentences similar to the sentence expression forms of the abstract, the claim and the specification; after retrieval, performing weighted integration on the three groups of patent documents, and selecting and outputting a plurality of patent documents with the highest similarity after weighting as the intention retrieval result of the user; therefore, the retrieval result is comprehensive and more accords with the real retrieval intention of the user, and the completeness and the accuracy of patent retrieval are improved.

Description

Patent document query method based on diffusion model and computer equipment

Technical Field

The application belongs to the technical field of document retrieval, and particularly relates to a patent document query method and computer equipment.

Background

Patent retrieval is used for patent duplication checking and infringement detection, which is a key core link in the patent application and right maintenance process, and how to realize accurate and efficient retrieval becomes an important content in patent system construction.

At present, a common patent retrieval method is generally realized based on matching ranking between a retrieval keyword phrase input by a user and a patent text, and particularly for scenes such as 'simple retrieval', 'semantic retrieval', and the like, multi-topic association possibly exists in the retrieval keyword input by the user, so that a short text input by the user cannot completely express the real retrieval intention of the user, the limited information amount of the short text is not matched with the rich semantic content of a patent document, and the integrity and the accuracy of final patent retrieval are not ideal.

Meanwhile, the traditional query expansion is realized by using similar word lists, word vectors and the like in the general field, however, the similar words in the general field cannot effectively capture the semantic similarity between the professional terms in the patent field. The methods cannot adapt to diversity efficient retrieval in a patent retrieval dynamic unknown (Zero Shot) scene, and the retrieval text automatically generated by patent retrieval query expansion has poor effect on improving the overall accuracy of retrieval.

Disclosure of Invention

The application provides a patent document query method based on a diffusion model and computer equipment, and aims to solve the problem that the completeness and accuracy of the current patent retrieval are not ideal.

Therefore, the following technical scheme is provided in the application:

a patent document query method based on a diffusion model comprises the following steps:

receiving text content input by a user;

if the text content input by the user through retrieval exceeds a preset length threshold, segmenting the text content, and then respectively sending the segmentation results into three diffusion models for diffusion generation, wherein clusters of all keywords in the segmentation results are jointly used as control signals of the diffusion models to limit the generation direction of the diffusion models; the three diffusion models are respectively marked as a first diffusion model, a second diffusion model and a third diffusion model, and training corpuses of the three diffusion models are respectively derived from an abstract, a claim and a specification and are used for correspondingly generating sentences similar to the abstract, the claim and the specification sentence expression form;

the sentences generated by the three diffusion models are sent into a retrieval system, and the abstract, the claim and the specification are respectively and correspondingly taken as retrieval ranges to retrieve patent documents, so that three groups of patent documents are obtained;

and performing weighted integration on the three groups of patent documents, and selecting a plurality of weighted patent documents with the highest similarity as the intention retrieval result of the user and outputting the intention retrieval result.

Optionally, the patent document query method further includes:

if the text content input by the user search does not exceed the preset length threshold, the text content input by the user search is directly sent to a search system, and patent documents are searched respectively by taking the abstract, the claim and the specification as search ranges to obtain three groups of patent documents.

Alternatively, the three sets of patent documents are of the same size. Of course, it may be different.

Preferably, the training method of the three diffusion models comprises the following steps:

gradually adding noise into the training corpus, continuously destroying corpus information, and storing corpus information in each step of destruction process until the original corpus information is destroyed to be completely random Gaussian noise, wherein the process is marked as a noise adding process; and then, denoising the completely random Gaussian noise, and continuously denoising by using a generative model by using the damaged corpus information stored in the denoising process as tag data to finally obtain original corpus information, so that the generative model learns the capacity of generating corresponding corpus through the denoising process.

Optionally, the generating model adopts a Transformer model or a GPT model.

Preferably, the method for generating the corpus includes:

extracting sentences from the abstract, the claim and the specification of the published patent document respectively, and recording the sentences as a first sentence, a second sentence and a third sentence;

and performing word segmentation on the first sentence, the second sentence and the third sentence by adopting a text word segmentation device respectively, wherein the corresponding word segmentation result is a training corpus used for the first diffusion model, the second diffusion model and the third diffusion model.

Preferably, in the three diffusion models, each diffusion model performs a diffusion generation process, which specifically includes:

performing word segmentation and word stop removal on text contents input by a user to obtain a plurality of keywords;

respectively searching a domain word list containing each keyword; the field word list is generated in advance based on a clustering algorithm;

and taking other words in the domain word list to which each keyword belongs as target words with similar semantics to the keywords, training a classifier corresponding to the class of the domain word list to obtain the probability of the diffusion model for each target word in the domain word list, further performing gradient updating on the hidden variables of the diffusion model, repeating multi-step diffusion, and mapping the hidden variables generated finally to texts through a softmax function to obtain sentences in the control direction of the keywords.

Preferably, the retrieval system performs similarity calculation on the sentence generated by the first diffusion model and the abstract text vector of the patent document, performs similarity calculation on the sentence generated by the second diffusion model and the claim text vector of the patent document, performs similarity calculation on the sentence generated by the third diffusion model and the specification text vector of the patent document, and returns N patent documents with the highest similarity respectively by adopting a bm25 model or bert model word vector representation mode.

Computer equipment comprising a memory and a processor, wherein the memory stores a computer program, and is characterized in that the processor implements the steps of the patent document inquiry method based on the diffusion model when executing the computer program.

A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the above-mentioned diffusion model-based patent document query method.

The application has at least the following beneficial effects:

the method comprises the steps of obtaining a plurality of keywords through word segmentation for a short text input by a user, and respectively sending the keywords into three diffusion models for diffusion generation, wherein clusters of the keywords in word segmentation results are jointly used as control signals of the diffusion models to limit the generation direction of the diffusion models; the training corpora of the three diffusion models are respectively derived from the abstract, the claim and the specification and are used for correspondingly generating sentences similar to the sentence expression forms of the abstract, the claim and the specification; sending the documents into a retrieval system for retrieval to obtain three groups of patent documents, performing weighted integration on the three groups of patent documents, and selecting and outputting a plurality of patent documents with the highest similarity after weighting as the intention retrieval result of the user; therefore, the retrieval result is more comprehensive and more accords with the real retrieval intention of the user, and the completeness and the accuracy of patent retrieval are improved.

Drawings

FIG. 1 is a schematic diagram illustrating a basic principle of a patent document query method based on a diffusion model according to an embodiment of the present application;

FIG. 2 is a diagram illustrating a training process of a diffusion model in an embodiment of the present application (taking a summary diffusion model as an example);

FIG. 3 is a schematic diagram of a training method for three diffusion models according to an embodiment of the present application;

FIG. 4 is a diagram illustrating a sentence generation process of a diffusion model in an embodiment of the present application (taking a summary diffusion model as an example);

FIG. 5 is a diagram illustrating a sentence generation method for three diffusion models according to an embodiment of the present application;

FIG. 6 is a diagram illustrating an extended search and integration process according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

In one embodiment, an application scenario, such as a user, enters a search statement on a website, APP, that provides patent search services, in a "simple search" manner (typically a search box presented on a home page) or in a manner of selecting "semantic search" (supporting longer text content).

As shown in fig. 1, there is provided a patent document query method based on a diffusion model, comprising the following steps:

receiving text content input by a user;

if the text content input by the user for searching does not exceed the preset length threshold, directly sending the text content input by the user for searching into a searching system (or carrying out appropriate preprocessing), and respectively searching for patent documents by taking the abstract, the claim and the specification as searching ranges to obtain three groups of patent documents;

if the text content input by the user in the search exceeds a preset length threshold, segmenting the text content, and then respectively sending the segmentation result into three diffusion models (an abstract diffusion model, a claim diffusion model and a specification diffusion model) for diffusion generation, wherein the clusters of all keywords in the segmentation result are jointly used as control signals of the diffusion models to limit the generation direction of the diffusion models; the three diffusion models can be abbreviated as a first diffusion model, a second diffusion model and a third diffusion model, and training corpuses of the three diffusion models are respectively derived from an abstract, a claim and a specification and are used for correspondingly generating sentences which are similar to the abstract, the claim and the specification sentence expression form; the sentences generated by the three diffusion models are sent into a retrieval system, and the abstract, the claim and the specification are respectively and correspondingly taken as retrieval ranges to retrieve patent documents;

note that the "claims" and "claims" mentioned here are different concepts, the former emphasizes the terms of the claims (each claim expresses an independent meaning and can generate a text vector; the same semantics, the statement expression form of the claims may be different from the statement expression form in the abstract and the specification), and the latter is one of the basic compositions of the patent document (the target range for similarity calculation);

in addition, the expression form similarity of the sentence is a concept different from the similarity calculation in patent document retrieval, the former focuses on the expression form of the sentence and aims to make the sentences generated by the three models respectively more like abstract sentences, claim sentences and specification sentences, and the latter focuses on semantic approximation;

performing weighted integration on the three groups of patent documents, and selecting K patent documents with highest similarity after weighting as intention retrieval results of the user and outputting the K patent documents; k is the preset space number and is less than 3N.

Specifically, the above-mentioned retrieval of patent documents with the abstract, the claims and the specification as retrieval ranges respectively includes performing similarity calculation between the sentence generated by the first diffusion model and the abstract text vector of the patent document in the patent library, performing similarity calculation between the sentence generated by the second diffusion model and the claim text vector of the patent document in the patent library, performing similarity calculation between the sentence generated by the third diffusion model and the specification text vector of the patent document in the patent library, and returning N patent documents with the highest similarity respectively to obtain three groups of patent documents (3N patent documents). Of course, the number of three sets of patent documents may also vary, for example: the first group of patent documents is set to be 100, the second group of patent documents is set to be 80, the third group of patent documents is set to be 50, and finally, the total number of the three groups of patent documents is 230; and (4) performing weighted integration on the three groups of patent documents, and selecting 150 patents with the highest similarity after weighting.

Among them, the technology (approximate search) for searching for patent documents using a specific search range (abstract, claim, or specification) as a search range itself is a prior art in the field.

The diffusion model is a deep generation model, query expansion automatic generation with input as a condition has the advantages of strong robustness, high sampling efficiency, semantic approximation, sample diversity and the like, and meanwhile, the generated result has certain interpretability. The incremental generation of the query contents through the diffusion model can enable the retrieval result to be more completely covered and improve the recall rate. The cost of missing related patents is high, and the recall index is very important. In addition, the new word and term in the patent is high in updating speed, and the diffusion model can generate interpretable expansion query with wide diversity coverage, so that the intuitive understanding of a user is facilitated. In the whole process, the diffusion model plays an important step, and in the embodiment, a diffusion model which can be controlled by the keywords is mainly trained, and the controllable diffusion model is applied to an extended retrieval system in the patent field.

The purpose of the first training part is to make the diffusion model obtain the capability of generating sentences in random text directions through training data and training methods, for example, through training of training data of various fields, the diffusion model can generate sentences in various field directions, including fields of artificial intelligence, computers, traffic and the like, but the diffusion model of the current step has no way to control the field direction of specifically generated sentences, the model can generate sentences in artificial intelligence fields at this step, and can also generate sentences in computer fields, which are completely random, and the first step only makes the model have the capability of generating sentences in all fields. In this step, in order to make the result of the patent extension search more accurate, the corpus of the training diffusion model can be used to train three models respectively using the corpus of the abstract, the claim and the specification, so that the sentences generated by the three models can be more like the sentences in the abstract, the claim and the specification.

The method for training the diffusion model in the embodiment mainly comprises the following steps:

extracting sentences from the abstract, the claim and the specification of the published patent document respectively, wherein the sentences can be marked as a first sentence, a second sentence and a third sentence; respectively segmenting words of the first sentence, the second sentence and the third sentence by adopting a text word segmentation device, wherein the corresponding word segmentation result is training corpus used for the first diffusion model, the second diffusion model and the third diffusion model;

gradually adding noise into corresponding training corpora according to each diffusion model, continuously destroying the corpus information, and storing the corpus information of each step of the destroying process until the original corpus information is destroyed to become completely random Gaussian noise, wherein the process is marked as a noise adding process; and then, denoising the completely random Gaussian noise, and continuously denoising by using a generative model by using the destroyed corpus information stored in the denoising process as tag data to finally obtain original corpus information, so that the model learns the capacity of generating corresponding corpus through the denoising process.

The second step generation part limits the generation direction of the diffusion model through the key words input by the user, so that the diffusion model can generate sentences with fixed text directions. For example, if the user wants to expand and search a sentence in the field of artificial intelligence, and thus inputs a keyword of "artificial intelligence", then in the second generation process, the model gradually migrates the self-generated sentence direction to the field of artificial intelligence according to the keyword input by the user, and finally generates a sentence in the field of artificial intelligence. The sentences in the artificial intelligence field can be used as search conditions with richer contents for the expanded query of the keywords of the artificial intelligence input by the user. Based on the models trained by the abstract, the claim and the specification in the first step, the keyword artificial intelligence input by the user is expanded and generated, and the three diffusion models can generate the abstract, the claim and the specification sentences in the three fields of artificial intelligence.

The specific process of the diffusion generation by the diffusion model in this embodiment mainly includes:

performing word segmentation and word stop removal on text contents input by a user to obtain a plurality of keywords; respectively searching a domain word list containing each keyword (each domain word list is generated in advance based on a clustering algorithm); and taking other words in the domain word list to which each keyword belongs as target words with similar semantics to the keywords, training a classifier corresponding to the class of the domain word list to obtain the probability of the diffusion model for each target word in the domain word list, further performing gradient updating on the hidden variables of the diffusion model, repeating multi-step diffusion, and mapping the hidden variables generated finally to texts through a softmax function to obtain sentences in the control direction of the keywords.

And a third step of searching the abstract, the claim and the instruction sentence of the three artificial intelligence fields obtained by the generation part respectively for the abstract, the claim and the instruction sentence of the patent document in the searching system. That is, the abstract sentences generated by the diffusion model in the artificial intelligence field are compared with the abstract information of the patent documents in the retrieval system for retrieval, and the retrieval system returns the former N patents based on the similarity between the abstract in the patent documents and the sentences generated by the diffusion model. Similarly, the claim sentences and the specification sentences respectively search the claim and the specification of the patent documents in the search engine, and also respectively return the first N patents with the similarity. And then carrying out weighted statistics on the 3N patents, finding out the first K patents which are most relevant to the artificial intelligence input by the user, taking the first K patents as keywords of the artificial intelligence input by the user, and returning the results of the K patents by the retrieval system.

Thus, the whole process can be summarized in the following three steps:

1. and a diffusion model training process, wherein the diffusion model obtains the capability of generating sentences in random directions, and the diffusion model is trained by the linguistic data of different patent parts respectively, so that the diffusion model can generate sentences of corresponding patent parts. The process does not enter the whole process of the extended search and is a precondition of the function of the extended search.

2. The process of generating sentences by the diffusion model is a part of an extended retrieval process, namely, the diffusion model gradually generates sentences in the keyword direction input by a user, and the sentences of the abstract, the claims and the specification in the keyword direction are respectively generated according to the trained diffusion model aiming at different parts of the patent.

3. And in the expanding retrieval and integration process, the generated sentences in the same field of the three parts are transmitted to a retrieval system to retrieve the three parts of the patent respectively, and the retrieval result is weighted and counted to find out the most similar previous K patents, so that the process of expanding retrieval is realized.

These three steps are described in further detail below.

1. A diffusion model training process, aiming to make the diffusion model obtain the capability of generating sentences in random domain directions. The diffusion model training step comprises the following steps: corpus construction, abstract diffusion model training, claim diffusion model training and specification diffusion model training.

Step 1, corpus construction: because of the extended search aiming at the patent field and the final aim that the model can generate sentences related to the abstract, the claim and the specification, the abstract, the claim and the specification text content of the patent are extracted from the collected patent documents, and the abstract, the claim and the specification content are divided by periods, semicolons and the like, and the divided sentences are used as the preliminary training corpora of the three diffusion models. Three different diffusion models are prepared and trained by the three parts of training corpora.

In order to avoid repeated explanation, the following processes take abstract sentence corpora as an example, and a diffusion model related to an abstract is trained, and the diffusion model of the claims and the specification is also a precondition which needs to be trained as a next generation process.

Step 2, abstract diffusion model training: as shown in fig. 2, the overall idea of the training process is to gradually add noise to the collected abstract corpus, continuously destroy the information of the whole corpus, and store the information of the destruction process at each step until the corpus information is destroyed to be completely random gaussian noise. This process is called forward propagation, i.e. the noising process. After the noise adding process, random Gaussian noise is obtained, noise reduction is needed to be carried out on the Gaussian noise continuously, the stored damaged corpus information in the noise adding process is used as tag data, noise reduction is carried out continuously by using a generative model such as a Transformer model or a GPT model, finally the noise reduction is carried out to obtain initial corpus information, and the generative model can learn the capacity of generating corresponding corpus through the noise reduction process. The specific training process is as shown in fig. 2:

(1) The abstract sentences acquired by the corpus constructing part are used as input data of the diffusion model, and in the example, the abstract sentences 'an artificial intelligent automobile, which comprise an automatic route searching method and a danger predicting module' are used as input texts. And then segmenting the input text, wherein the text segmenter can perform segmentation by pre-training or directly using a trained segmenter, such as a hand segmenter or a jieba segmenter. The obtained word segmentation result is w, wherein w is a word list after the sentence of the input data is segmented. Assuming that the sentence of the input data has n words in total after word segmentation, the input data is divided into a plurality of words

The word segmentation result in this example may be:

。

(2) The word segmentation result w is transmitted to a word vector embedding layer EMB, so that discrete words are mapped to a continuous space, and the obtained word embedding result is as follows:

n words are mapped into n d-dimensional vectors.

(3) The embedded word vectors are then transformed through a Markov chain into hidden variables in a diffusion model and through a probabilistic model

Generates a corresponding hidden variable->

Wherein

Is given>

Then, is generated @, via word vector encoding and Markov chain>

The probability of (c).

Represented by>

Is mean value, is based on>

Is normally distributed with variance, and>

the value of (c) is the value sampled in the normal distribution. In the reverse process, a trainable approximation model step is added that will ≥>

And mapping back the original text word segmentation content again, wherein the mapping relation is as follows: />

Wherein

Is a softmax profile and->

Has the meaning that at a given->

On the premise that a softmax distribution results in +>

The probability of (c). In the following, for convenience of understanding, the term->

As a probabilistic representation of feed-forward propagation, will->

As a probabilistic representation of the inverse denoising process.

(4) During the feedforward propagation process, intermediate hidden variables are constructed

This feed forward propagation is progressively pick->

Adding Gaussian noise until T step is added, and at T step, based on the comparison result, determining whether the signal is greater than or equal to>

Close to Gaussian noise, the transfer of each step is based on>

To/>

Are all by>

Is sampled from the normal distribution of (a). Wherein->

For adding the amount of Gaussian noise in step t, <' >>

The feed forward process q is a hyperparameter and may contain no trainable parameters and may define a training objective, including generating noisy data according to a predefined feed forward process q and training a model to reverse the process and reconstruct the data.

(5) In reverse propagation, the diffusion model passes through the reverse channel

Gradually denoising the Gaussian noise to gradually reconstruct

. The whole process is that the model is judged to be based on the result of the reconstruction process>

Gaussian noise begins to be de-noised gradually, so that a series of hidden variables ≥ are generated>

So that a sample close to the target distribution pick>

. Has an initial state of->

And the de-noising process of each step->

To>

Are all determined by>

Is obtained, wherein>

And &>

Can pass through->

Or->

And calculating and learning. Wherein the data of this training process is that the input is->

The output is based on the fact that it is taken during the forward process of diffusion>

，/>

The models are used to learn the distribution of the feedforward propagationThe value and the variance of the measured values, wherein in/on>

For example, in the de-noising process,

is inputted as->

And the output is the passing of the denoising process>

Predicted->

And this->

To approach ^ in feed-forward propagation>

Based on a feed-forward propagation>

And de-noising process>

Is taken as->

Is updated by back propagation>

The parameter (2) of (1), thereby making->

The mean and variance in the current distribution are learned.

(6) Diffusion model by maximizing data

Is trained and the normalized target is ≥ based on the likelihood value>

Is lower bound, the loss function of the diffusion model becomes, therefore:

however, this training objective is not stable and requires a great deal of optimization skill, so a simple alternative objective was devised

Is expanded and re-weighted to obtain a mean square error loss, so the loss function of the diffusion model becomes->

Wherein

Is the posterior probability->

Is close to gaussian noise.

Is predicted by a neural network>

Is measured.

(7) Mapping the word vectors in the step (3) into hidden variables

Is greater or less>

And to be reconstructed

Process of remapping words back>

The loss function is carried into the step (6), and finally the end is obtainedTraining loss function of the end:

(ii) a It can also be optimized as: />

；

Both training loss functions are initially equivalent;

and training the diffusion model through a loss function, and performing back propagation to complete the training of the single diffusion model.

Step 3, in order to improve the retrieval accuracy, the claim diffusion model and the specification diffusion model are retrained according to the processes (1) to (7) in the step 2, and the corpus information corresponds to sentences of the claim and the specification; three diffusion models were thus obtained, as shown in fig. 3.

2. And the diffusion model generates a sentence process, wherein the process aims to perform content diffusion generation in the keyword direction on the trained three diffusion models respectively aiming at the keywords input by the user, so as to generate sentences which correspond to the three parts of the patent and belong to the keyword field direction. The diffusion model generation sentence process can be divided into a diffusion model preparation, a diffusion model generation abstract sentence, a diffusion model generation claim and a diffusion model generation specification sentence.

Step 1, diffusion model preparation: firstly, a domain vocabulary is pre-trained or a pre-trained domain vocabulary is used, for example, keywords such as artificial intelligence and neural network may be included in the artificial intelligence domain. This individual domain vocabulary is then viewed as a bag of words for nbow. The word list in the field can encode each word into a vector after segmenting words, removing duplication and removing stop words of text contents of all Chinese patents, and the process of encoding the vector can encode the words into word vectors by using an existing word vector library or by using a bert model and the like. Clustering the words contained in all Chinese patents by using the encoded word vectors in a KNN or Kmeans clustering manner to obtain clusters, namely clustering words in the graph 4, wherein the individual clustering word lists obtained by clustering are regarded as field word lists.

And 2, generating the abstract sentence by the diffusion model. The main purpose of this step is to control the abstract diffusion model to generate sentences related to the direction of the phrases input by the user through the phrases input by the user. The first input is a keyword phrase input by a user, the keyword phrase is used as control information to control a diffusion model to generate a sentence text in a keyword direction, the second input is random Gaussian noise, and the diffusion model is based on the Gaussian noise and is used for continuously denoising the noise so as to generate a fluent sentence.

The overall flow of the generation phase is as shown in fig. 4: firstly, a phrase text input by a user, such as an 'artificial intelligence image' input by the user, is input, the user possibly wants to search contents in some artificial intelligence directions and relates to some artificial intelligence image recognition or image processing methods, two keywords of 'artificial intelligence' and 'image' are obtained after the phrase input by the user is subjected to word segmentation and word deactivation, the two keywords are respectively searched in a cluster, and the cluster containing the two keywords is searched; as shown in fig. 4, the cluster 1 includes artificial intelligence, the cluster 2 includes images, the cluster 1 and the cluster 2 are used as control signals of a diffusion model to control the diffusion model to generate words in the two clusters, and the control process is to train a classifier, the class of the classifier is the class of the clusters, the denoising result of each step of the diffusion model is predicted by using one classifier, the prediction result has a certain deviation from the keyword clustering result input by a user, then a loss function is formed by the deviation, and the hidden variable of the current step of the abstract diffusion model is modified in a gradient updating manner in a back propagation manner, and the hidden variable modified by back propagation is more biased to the directions of the cluster 1 and the cluster 2. However, sentences which are fluent and highly related to the cluster 1 and the cluster 2 are not generated at once by one-step diffusion and back propagation, so that the process needs to be repeated for multiple steps, hidden variables are gradually shifted towards the cluster 1 and the cluster 2, and fluent sentences are generated. This step is a hyper-parameter and the invention locates 200 steps. And mapping the generated hidden variables to texts by using a softmax function through diffusion generation and direction migration in 200 steps so as to obtain the generated sentences. The generated sentences are the output of the step, namely, the abstract diffusion model successfully generates sentences which are related to the artificial intelligence and the image and are similar to the abstract. And the specific implementation and formula logic are as follows.

The diffusion model generation stage is from Gaussian noise

De-noising step by step and then generating fluent sentence hidden variable->

And then based on the approximation model in step (3) of the diffusion model training process described previously will +>

And the text sentence is re-mapped, and the process is the generation process of the general diffusion model. However, it can also be seen that the whole process starts with gaussian noise and does not control a randomly generated sentence, so that the general sentence generated by the diffusion model has no way to control the direction of the sentence. Therefore, the embodiment controls the diffusion model to generate the sentence with the keyword direction, i.e. to control the generation direction of the hidden variable in the diffusion model, i.e. to control ≥ r>

The value of the hidden variable. The embodiment is characterized in that the control is realized by a domain vocabulary

The field vocabulary is the cluster vocabulary previously generated in step 1, so that the control process can be represented by the probability formula>

Wherein c represents a control condition, i.e. a keyword, and the probability formula represents the generation of a hidden variable ≥ based on the given keyword>

The probability value of (2). And for each step of the diffusion process a hidden variable->

Is based on the previous step>

Generated by combining with control conditions (keywords) and obtained by Bayesian formula

By the assumption of independent conditions, the method can be simplified into

. To be->

Make->

For example, first->

Is transmitted into a model (usually a Transformer) trained by the diffusion model, and predicts ≥ by this Transformer>

Will then->

Input into a classifier, and predict->

Will then->

Is updated by means of a back-propagation process>

Is taken into consideration, at this time->

It is shifted one step towards the target direction. Then the offset->

Again input to the transform to predict >>

Repeating the above steps by analogy until the T step is repeated to obtain->

Will >>

And performing text prediction through softmax to obtain a corresponding text result. The text result obtained at this time is the text result obtained after the control by the target direction.

Thus, for the t-th step of the diffusion process, it can be updated by the following formula

The value of (c):

wherein

Is obtained by a diffusion model, and the main function of the diffusion model is to generate fluent texts and to combine the texts with the words>

Is obtained by a classifier of a neural network, the main function of the classifier is to generate texts in the direction of control conditions (keywords), and in addition, a ^ or a 'real' text is added for generating more fluent texts>

The hyper-parameter balances the fluency of the text and the direction of the text, so this gradient update may become

。

As described hereinbefore

Can be obtained by a well-trained diffusion model in the diffusion model training process, and->

A classifier is required to obtain the corresponding probability value. For +>

In fact, given a hidden variable>

Judging the hidden variable as the control condition>

The probability of (c). Usually, a classifier needs to be trained to obtain the probability value, but since there are too many keywords that may appear, it is difficult to output hidden variables ≥ by using all the keywords as labels>

The probability value may correspond to a keyword, so this embodiment adopts an nbow model to calculate the probability value, and the domain vocabulary obtained in step 1 is taken as nbow. Firstly, finding out which domain or domains the keyword belongs to, and then taking the words in the domains as target words similar to the semantics of the keyword, so that the probability value is the probability value of the diffusion language model to each word currently in the domain word list, adding the probability values and taking the logarithm: />

Wherein->

For a word in the domain word list, p is generated during reconstruction>

Probability value of a word. From this step, it is therefore possible to ascertain the current hidden variable +>

And inputting the probability value obtained by the classifier. Carrying out gradient updating on hidden variables of the diffusion model through the field word list and the probability value of the diffusion model, and then carrying out the next step of->

Is a hidden vector that is closer to the control condition. Generated by diffusing T steps>

Is the final hidden variable, the resulting->

And inputting the text into an approximate model in the diffusion model to obtain corresponding sentence text so as to generate a corresponding sentence. T is a hyper-parameter that may be set to 200.

And 3, generating the claim and the instruction sentence by the diffusion model, and repeatedly generating the claim diffusion model and the instruction diffusion model according to the content in the step 2 to finally obtain the abstract, the claim and the instruction sentence related to the artificial intelligence and the image, as shown in fig. 5.

3. Extended search and integration, as shown in FIG. 6:

expanding and retrieving: the abstracts, the claims and the instruction sentences generated by the diffusion model are respectively sent to a retrieval system for retrieval, and are compared in the patent abstracts, the claims and the instruction sentences in a patent library in the retrieval system. Taking the abstract sentence as an example, the abstract sentence is searched and compared with the abstract part of the patent in the search system, and the previous N patents similar to the input abstract sentence are returned. The retrieval system can calculate the similarity of the sentences generated in the diffusion model and the abstract of the patent, the claims and the text vector of the specification part in a bm25 model or bert model word vector representation mode and return topN patents with higher similarity.

Integration: and selecting the top K patents with the highest weight as the extended retrieval by weighting the similarity of the three acquired 3N patents. The weighting coefficient may be set as required, for example, the weighting method may be that the returned patent similarity of each part is assigned with the same weight, and then the first K patent documents are the first 3N patents which are ranked from high to low in patent similarity.

The embodiment performs diffusion generation on the query key words input by the user in a short way, so as to generate longer and more diversified sentences, the sentences can also correspond to the parts of the abstract, the claim and the specification of the patent, so that the retrieval system can perform more accurate retrieval, similarity comparison is performed on the three parts of the patent, therefore, the patent retrieval system can obtain more accurate information of the user, the retrieval system can retrieve the content which the user wants to obtain, the retrieval text which is similar to the abstract of the patent is generated by combining with iterative supplementation of a user interaction mechanism, patent retrieval is realized by combining with a multi-stage text similarity matching sorting algorithm, the defect of lack of fine retrieval in the prior art is overcome, the accuracy of patent label retrieval is improved, and the purposes of releasing manpower, reducing cost and improving efficiency are achieved.

The embodiment can be realized by software, and the product form can be a computer device loaded with the corresponding software and can also be a computer readable storage medium. For example:

computer equipment, comprising a memory and a processor, wherein the memory stores a computer program, and is characterized in that the processor realizes the steps of the patent document inquiry method based on the diffusion model when executing the computer program.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A patent document query method based on a diffusion model is characterized by comprising the following steps:

receiving text content input by a user;

if the text content input by the user through retrieval exceeds a preset length threshold, segmenting the text content, and then respectively sending the segmentation results into three diffusion models for diffusion generation, wherein clusters of all keywords in the segmentation results are jointly used as control signals of the diffusion models to limit the generation direction of the diffusion models; the three diffusion models are respectively marked as a first diffusion model, a second diffusion model and a third diffusion model, and training corpuses of the three diffusion models are respectively derived from an abstract, a claim and a specification and are used for correspondingly generating sentences which are similar to the abstract, the claim and the specification sentence expression form;

performing weighted integration on the three groups of patent documents, and selecting and outputting a plurality of weighted patent documents with the highest similarity as intention retrieval results of the user;

the training method of the three diffusion models comprises the following steps: gradually adding noise into the training corpus, continuously destroying corpus information, and storing corpus information in each step of destruction process until the original corpus information is destroyed to be completely random Gaussian noise, wherein the process is marked as a noise adding process; then, denoising the completely random Gaussian noise, continuously denoising by using a generative model through taking the damaged corpus information stored in the denoising process as tag data, and finally obtaining original corpus information, so that the generative model learns the capacity of generating corresponding corpus through the denoising process;

in the three diffusion models, each diffusion model performs a diffusion generation process, which specifically includes:

2. The diffusion model-based patent document query method according to claim 1, further comprising:

3. The diffusion model-based patent document query method of claim 2, wherein the three groups of patent documents have the same number of sections.

4. The diffusion model-based patent document query method according to claim 1, wherein the generative model is a Transformer model or a GPT model.

5. The method for querying patent documents based on diffusion model according to claim 1, wherein said generating method of training corpus comprises:

and respectively segmenting the first sentence, the second sentence and the third sentence by adopting a text word segmentation device, wherein the corresponding word segmentation result is the training corpus used for the first diffusion model, the second diffusion model and the third diffusion model.

6. The diffusion model-based patent document query method according to claim 1,

the retrieval system adopts a bm25 model or a bert model word vector representation mode to calculate the similarity between sentences generated by the first diffusion model and abstract text vectors of patent documents, calculate the similarity between sentences generated by the second diffusion model and claim text vectors of the patent documents, calculate the similarity between sentences generated by the third diffusion model and specification text vectors of the patent documents, and return N patent documents with the highest similarity.

7. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor when executing the computer program implements the steps of the diffusion model based patent document query method according to any one of claims 1 to 6.

8. A computer-readable storage medium, on which a computer program is stored, wherein the computer program, when being executed by a processor, implements the steps of the diffusion model-based patent document query method according to any one of claims 1 to 6.