CN117312513A

CN117312513A - Document search model training method, document search method and related device

Info

Publication number: CN117312513A
Application number: CN202311260327.8A
Authority: CN
Inventors: 甘兵; 张茂华; 廖瑞毅
Original assignee: Digital Guangdong Network Construction Co Ltd
Current assignee: Digital Guangdong Network Construction Co Ltd
Priority date: 2023-09-27
Filing date: 2023-09-27
Publication date: 2023-12-29
Anticipated expiration: 2043-09-27
Also published as: CN117312513B

Abstract

The invention discloses a document search model training method, a document search method and a related device, comprising the following steps: the method comprises the steps of constructing a document index of a training document, extracting a plurality of sentence groups comprising at least one sentence from each document, determining a target sentence group from the plurality of sentence groups, vectorizing sentences in the target sentence group to be used as keyword vectors of search keywords of the document, wherein the target sentence group is the sentence group with highest semantic relevance of each sentence, constructing a training sample by using the keyword vectors of the document and the document index to train a document search model, constructing a training sample by using vectors of a plurality of sentences related to semantics in the target sentence group as the keyword vectors and the document index to train the document search model, so that the document search model learns a word and a vector of the sentence which have the same semantic as the search keywords according to the search keywords, is suitable for the search scene of complex semantics, and can improve recall rate of complex semantic search.

Description

Document search model training method, document search method and related device

Technical Field

The present invention relates to the field of information retrieval technologies, and in particular, to a document search model training method, a document search method, and related devices.

Background

When a government system uses a robot to deal with the government service consultation of a user, the user is usually replied by searching related documents through the questioning words of the user.

In the prior art, a search engine of an elastic search is mostly adopted to search a document, the elastic search is mainly an inverted index, a document ID is specifically set for each document, the content of the document is expressed as a set of keywords, for example, the document is segmented, a plurality of keywords are extracted, the occurrence times and occurrence positions of each keyword in the document are recorded, the inverted index is formed by mapping the keywords to the document IDs, each keyword is mapped to a plurality of document IDs, and the document corresponding to the mapped document ID comprises the keyword.

By mapping the keywords to the document ID, only the document containing the keywords can be searched during searching, only the keyword character strings with simple semantics can be matched with a searching scene, the document containing the same semantics as the searching keywords can not be searched, the method is difficult to be applied to the searching scene with complex semantics, and the recall rate is low during the searching of the complex semantics.

Disclosure of Invention

The invention provides a document search model training method, a document search method and a related device, which are used for solving the problems that the existing document search cannot search for documents containing the same semantic meaning as a search keyword, is difficult to be applied to a complex semantic search scene, and causes low recall rate in complex semantic search.

In a first aspect, the present invention provides a document search model training method, including:

constructing a document index of a training document;

extracting a plurality of sentence groups from each document, wherein each sentence group comprises at least one sentence;

determining a target sentence group from the multiple sentence groups, and vectorizing sentences in the target sentence group to serve as keyword vectors of search keywords of the document, wherein the target sentence group is a sentence group with highest semantic relevance of each sentence;

constructing a training sample by adopting the keyword vector of the document and the document index;

and training a document search model by adopting the training sample, wherein the document search model outputs a document index when a search keyword is input.

In a second aspect, the present invention provides a document searching method, including:

Receiving search information input on a user terminal;

inputting the search information into a document search model to obtain a document index;

searching a target document matched with the document index in a preset document database;

transmitting the target document to the user terminal;

wherein the document search model is trained by the document search model training method of the first aspect.

In a third aspect, the present invention provides a document search model training apparatus, including:

the document index construction module is used for constructing a document index of the training document;

the document sentence group extraction module is used for extracting a plurality of sentence groups from each document, wherein each sentence group comprises at least one sentence;

the keyword vector acquisition module is used for determining a target sentence group from the plurality of sentence groups, and vectorizing sentences in the target sentence group to serve as keyword vectors of search keywords of the documents, wherein the target sentence group is a sentence group with semantic relevance of each sentence;

the training sample construction module is used for constructing a training sample by adopting the keyword vector of the document and the document index;

And the training module is used for training the document search model by adopting the training sample, wherein the document search model outputs a document index when the search keyword is input.

In a fourth aspect, the present invention provides a document searching apparatus comprising:

the search information receiving module is used for receiving search information input on the user terminal;

the document index prediction module is used for inputting the search information into a document search model to obtain a document index;

the target document searching module is used for searching a target document matched with the document index in a preset document database;

a target document sending module, configured to send the target document to the user terminal;

In a fifth aspect, the present invention provides an electronic device, including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the document search model training method of the first aspect of the invention and/or the document search method of the second aspect.

In a sixth aspect, the present invention provides a computer readable storage medium storing computer instructions for causing a processor to implement the document search model training method according to the first aspect of the present invention and/or the document search method according to the second aspect of the present invention when executed.

According to the embodiment of the invention, a plurality of groups of sentence groups are extracted from a document, each group of sentence groups comprises at least one sentence, a target sentence group is further determined from the plurality of groups of sentence groups, the target sentence group is the sentence group with highest semantic relevance of each sentence, the sentences in the target sentence group are vectorized to be used as the keyword vectors of the search keywords of the document, the keyword vectors of the document and the document indexes are adopted to construct training samples, the training samples are adopted to train the document search model, the document search model outputs the document indexes when the search keywords are input, the training sample training document search model is constructed through the keyword vectors of the plurality of sentences related to the semantics in the target sentence group as the keyword vectors of the document, and the training sample document search model is constructed through the keyword vectors and the document indexes, so that the document search model learns the search keywords input by a user and the document indexes matched with the search keywords and the vectors of sentences with the same semantics as the search keywords, the complex semantic search scene can be improved, and the recall rate of the complex semantic search can be used for the document indexes with the complex semantic search accuracy, and the target document is searched through the document indexes.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the invention or to delineate the scope of the invention. Other features of the present invention will become apparent from the description that follows.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a document search model training method according to an embodiment of the present invention;

FIG. 2A is a flowchart of a document search model training method according to a second embodiment of the present invention;

FIG. 2B is a schematic diagram of various matrices in an embodiment of the invention;

FIG. 2C is a schematic diagram of a directed acyclic graph in an embodiment of the invention;

FIG. 2D is a schematic diagram of a document collection hierarchy in an embodiment of the present invention;

FIG. 3 is a flowchart of a document searching method according to a third embodiment of the present invention;

FIG. 4 is a schematic structural diagram of a training device for document search model according to a fourth embodiment of the present invention;

FIG. 5 is a schematic diagram of a document searching apparatus according to a fifth embodiment of the present invention;

fig. 6 is a schematic structural diagram of an electronic device according to a sixth embodiment of the present invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

Example 1

Fig. 1 is a flowchart of a document search model training method according to an embodiment of the present invention, where the method may be applied to training a document search model to predict a situation of searching for a target document through a document search model index, and the method may be performed by a document search model training device, where the document search model training device may be implemented in a form of hardware and/or software, and the document search model training device may be configured in an electronic device. As shown in fig. 1, the document search model training method includes:

S101, constructing a document index of a training document.

In this embodiment, the training document may be a document used for training a document search model, and in one example, the document may be various documents under government affairs, such as various regulations, government affair handling flow files, question answer files of a government affair question-answering system, and the like, and different documents may be obtained as training documents under different application scenarios.

The document index may be information indicating searching for the document, and by way of example, the document index may be a storage path of the document, a structural path of the document in the structured data, and information such as a classification number of the document, a position in the classification, etc., which is not limited in this embodiment, and a manner of constructing the document index is not limited.

S102, extracting a plurality of sentence groups from the document according to each document, wherein each sentence group comprises at least one sentence.

Specifically, the embodiment can filter sentences which have no semantic relevance to the content of the whole document from the document to obtain the rest m sentences, and randomly extract n sentences from the m sentences to obtain multiple combined groups of n sentences, wherein each group of sentences at least comprises one sentence, namely n is greater than or equal to 1.

Of course, in another embodiment, m sentences may be randomly extracted from the document, and n sentences may be randomly extracted from the m sentences, to obtain multiple groups of n sentences in various combinations.

S103, determining a target sentence group from a plurality of sentence groups, and vectorizing sentences in the target sentence group to serve as keyword vectors of search keywords of documents, wherein the target sentence group is a sentence group with highest semantic relevance of each sentence.

In one embodiment, for each sentence group, word segmentation processing may be performed on each sentence in the sentence group to obtain a plurality of words of each sentence, then the similarity between the target word of each sentence and the words of the rest of sentences is calculated, the number of words having the similarity greater than a threshold value with the target word is counted, the ratio of the number to the total number of all the words in the sentence group is calculated, as the score of the target word, the sum of the scores of all the words in the sentence group is calculated as the score of the sentence group, the higher the score of the sentence group is, the higher the similarity between the different words of each sentence in the sentence group is determined, and the number of words having the same semantics is increased.

S104, constructing a training sample by using the keyword vector of the document and the document index.

In one embodiment, positive samples may be constructed using multiple keyword vectors and document indexes of the same document, and negative samples may be constructed using multiple keyword vectors and document indexes of different documents, resulting in positive and negative samples for training.

S105, training a document search model by adopting a training sample, wherein the document search model outputs a document index when a search keyword is input.

The document searching model can be various neural networks, and can be exemplified by a neural network comprising coding and decoding structures, keyword vectors in training samples can be randomly extracted and input into the document searching model after initializing network structures and model parameters of the document searching model to obtain predicted document indexes, then the document indexes in the training samples and the predicted document indexes are combined with a preset loss function to calculate loss values, model parameters are adjusted through the loss values until the loss values are smaller than a loss threshold value or the training iteration times reach the preset times, and a trained document searching model is obtained, the document searching model outputs document indexes after inputting search information of users, and target documents are searched through the document indexes and returned to the users.

According to the embodiment of the invention, a plurality of groups of sentence groups are extracted from a document, each group of sentence groups comprises at least one sentence, a target sentence group is further determined from the plurality of groups of sentence groups, sentences in the target sentence groups are vectorized to be used as keyword vectors of search keywords of the document, wherein the target sentence groups are sentence groups with highest semantic relevance of each sentence, a training sample is constructed by using the keyword vectors of the document and the document indexes, the document search model is trained by using the training sample, wherein the document search model outputs the document indexes when the search keywords are input, the document search model is trained by using the vectors of a plurality of sentences related to semantics in the target sentence groups as the keyword vectors of the search keywords of the document, and the training sample is constructed by using the keyword vectors and the document indexes, so that the document search model learns the document indexes according to the search keywords input by a user and words and sentences with the same semantic meaning as the search keywords, the document indexes are matched with the search scenes of the search keywords, the recall rate of complex semantic search can be improved, and the document indexes with exact recall of complex semantic search can be searched for the document indexes.

Example two

FIG. 2A is a flowchart of a document search model training method according to a second embodiment of the present invention, where the document search model training method according to the first embodiment of the present invention is optimized on the basis of the first embodiment, as shown in FIG. 2A, and includes:

s201, vectorizing the training documents to obtain document vectors of each document.

In one embodiment, a document applied to a certain business may be determined as a training document, and, illustratively, a document for various businesses in a government system may be determined as a training document and vectorized to obtain a document vector, and in one example, a document may be input into various language processing models such as Bert, a discontent, etc. to vectorize a document to obtain a document vector for each document.

S202, constructing an aggregation degree matrix based on the document vector, wherein each element in the aggregation degree matrix is the similarity between two documents.

In one embodiment, the similarity of the document vectors of the two documents may be calculated, an association matrix is constructed according to the similarity, and a degree matrix is further constructed according to the association matrix, and the aggregation degree of the two documents is calculated based on the similarity of the document vectors of the two documents and the degree of the two documents in the degree matrix to generate the aggregation degree matrix.

Specifically, cosine similarity, euclidean similarity, pearson correlation coefficient and the like of the document vectors of the two documents can be calculated, when the similarity of the two documents is smaller than a threshold value, the two documents are determined to be irrelevant, otherwise, when the similarity of the two documents is larger than or equal to the threshold value, the two documents are determined to be relevant, when the similarity of the document vectors of the two documents is smaller than a preset threshold value, the relevance of the two documents is marked as 0 to indicate that the two documents are irrelevant, and when the similarity of the document vectors of the two documents is larger than or equal to the preset threshold value, the relevance of the two documents is marked as 1 to indicate that the two documents are relevant, so that an relevance matrix R shown in the graph 2B is generated, in the relevance matrix R, an element of 0 indicates that the corresponding two documents are irrelevant, and an element of 1 indicates that the two documents are relevant.

After the incidence matrix R is generated, an undirected acyclic graph may be generated according to the incidence matrix R, as shown in fig. 2C, where each node in the undirected acyclic graph is a document, and edges of the nodes represent that there is an association between the documents, and then the number of edges of each document may be counted to be used as a degree of the document, and a degree matrix is generated by the degree of the document, as shown in fig. 2B, where a diagonal line in the degree matrix D is the degree of each document, and the degree of the document represents the number of documents associated with the document, that is, the number of edges of the nodes in the undirected acyclic graph.

The aggregation degree matrix may be a matrix formed by aggregation degrees of two documents, the aggregation degrees of the two documents may be calculated according to the similarity of document vectors of the two documents and the degree of each document in the degree matrix to generate an aggregation degree matrix, in one example, the product of the similarity of the two documents and the aggregation degree of the two documents may be calculated to be the aggregation degree of the two documents, in another example, the sum value may be calculated after normalization processing is performed on the degrees of the two documents, the sum value is taken as a weight, and the product of the similarity of the two documents and the weight is calculated to be the aggregation degree of the two documents, thereby obtaining an aggregation degree matrix U as in fig. 2B.

According to the method, the association matrix R is constructed through the similarity of the document vectors of the two documents, the degree matrix is further constructed according to the association matrix R, the degree of aggregation of the two documents is further calculated according to the degree of aggregation matrix and the similarity of the two documents, so that the degree of aggregation represents the possibility degree of dividing the two documents into the same document set when the plurality of documents are divided into the plurality of document sets, and therefore basis is provided for dividing the document sets, and the documents in the same document set are documents with high semantic relevance.

S203, dividing the plurality of documents for training into a preset number of document sets according to the aggregation degree matrix.

In one embodiment, an objective function may be constructed, where the objective function is used to calculate an aggregate loss after a plurality of documents are divided into a preset number of document sets, where the aggregate loss is used to represent an aggregate loss after the associated documents are divided into different document sets, solve a document set dividing scheme that makes a function value of the objective function minimum, obtain a preset number of divided document sets, further count a number of documents in each document set, determine whether there is a target document set with a number of documents greater than a threshold of the number of documents, and if yes, divide the documents in the target document set according to an aggregate matrix to obtain a sub-document set of the target document set, where a number of documents included in the sub-document set is less than the threshold of the number of documents.

Exemplary, assume that the training document is x _i I=n, dividing the training document into m document sets, i.e., into c= (C1, C2, C3, C4, C5, … …, cm), assuming that the document includes x1, x2, x3, x4, x5, m=2, i.e., into two document sets, let a= { x1, x3}, b= { x2, x4, x5}, the objective function is as follows:

U(A，B)＝∑ _{i∈A，j∈B} u _ij (1)

In the above formula (1), u _ij Taking the document set A= { x1, x3}, B= { x2, x4, x5} as an example, since x1, x2, x3, x4, x5 are divided into A, B two document sets, the degree of aggregation between the documents x1 and x2, x1 and x4, x1 and x5, x3 and x2, x3 and x4, x3 and x5 is lost, and the above formula (1) calculates the degree of aggregation U between the documents x1 and x2, x1 and x4, x1 and x5, x3 and x2, x3 and x4, x3 and x5 _ij The sum of (2), i.e., the total loss of polymerization.

The aggregation loss is described above by taking the document comprising x1, x2, x3, x4, x5, divided into two document sets a and B as an example, and when more documents are divided into m document sets, the objective function of the aggregation loss is as follows:

in the formula (2), U (x) _i C) dividing the document x into a plurality of document sets _i Aggregation loss with documents in each document collection, U (x _i ，x _i ) For document x in document collection _i The loss of degree of polymerization calculated repeatedly at the time,the sum of the aggregation degrees can normalize the aggregation degree loss by continuously optimizing the documents in each document set in C so that the final document set c= (C1, C2, C3, C4, C5, … …, cm) is obtained when the J (C) function value is minimized.

When the number of the documents in the document set obtained by dividing is greater than the preset number, for example, greater than 10 documents, the document set with the number of the documents greater than 10 is continuously divided to obtain a plurality of sub-document sets, until the number of the sub-document sets is less than or equal to the preset number, for example, if the number of the documents in the document set C5 is greater than 10, the documents in the document set C5 are continuously divided according to the above-mentioned document dividing method, so as to obtain sub-document sets C51, C52 and the like of the document set C5.

As shown in fig. 2D, a schematic view of a document set hierarchy is shown, and as shown in fig. 2D, the first hierarchy L01 document set includes C1, C2, C3, C4, C5, … …, cm for m document sets, and the second hierarchy L02 includes sub-document sets C51 and C52 of the document set C5.

S204, determining a document index of the document according to the position of the document in the document set.

In one embodiment, a hierarchy of a document set in which the document is located, a hierarchy position of the document set in the hierarchy, and a document position of the document in the document set may be determined, and the hierarchy, the hierarchy position, and the document position are encoded to obtain a document index of the document.

As shown in fig. 2D, a document set including a first level L01 and a second level L02, where each level is provided with a position, the level position, and the document position may be encoded to obtain a document index of the document, and in one example, the encoding rule may be: the document x2 is exemplified by the hierarchy+the hierarchy position+the document position, the document index thereof may be 010102, the document x2 is represented by the 2 nd document in the 01 st hierarchy and 1 st document set C1, the document x10 is exemplified by the document x10, the document index thereof may be 020207, the document x10 is represented by the 7 th document in the 02 nd hierarchy and 2 nd document set C52, and of course, the encoding rule may refer to the encoding mode of the root node and the child node in the tree structure, and the encoding rule is not limited herein.

According to the method, the document is divided into a plurality of document sets according to the aggregation degree matrix, the document in each document set is a document with high semantic relevance, and the document index is obtained by structuring unstructured documents through the layers, the positions of the document sets in the layers and the positions of the document in the document sets through the layers of the document sets, the positions of the document sets in the layers and the positions of the document sets in the document sets, so that the document index is obtained, and the document searching efficiency is improved.

S205, extracting a plurality of sentence groups from the document according to each document, wherein each sentence group comprises at least one sentence.

Specifically, for each document, the remaining m sentences can be obtained by filtering sentences which are not semantically related to the content of the whole document from the document, and then randomly extracting n sentences from the m sentences to obtainAnd each sentence group comprises n sentences.

S206, aiming at each sentence group, segmenting each sentence in the sentence group to obtain the segmentation of each sentence.

Each sentence in the sentence group is composed of a plurality of words and phrases, and the words of each sentence can be obtained by word segmentation algorithms based on character string matching, word segmentation algorithms based on understanding, word segmentation algorithms based on trees, word segmentation algorithms based on statistics and the like.

In another embodiment, a word segmentation model may be pre-trained to input sentences into the word segmentation model to obtain a plurality of word segments of the sentences.

S207, calculating the matching degree of each target word of each sentence in the sentence group and the words of other sentences.

Specifically, for each sentence in the sentence group, each word of the sentence can be traversed, the traversed word is determined as a target word, and the similarity of the target word and the vectors of the words of other sentences is calculated as the matching degree.

S208, when the matching degree is larger than a preset matching degree threshold value, accumulating 1 by a counter of the target word segmentation.

When the matching degree of the target word and the word of the other sentence is greater than a threshold value (such as 0.9), the counter of the target word is accumulated by 1, and the number of the word matched with the target word in the word of the other sentence is increased by 1.

S209, after the matching degree calculation of the target word and the word of other sentences is finished, calculating the ratio of the counting number of the counter to the total number of the word in the sentence group to be used as the score of the target word.

After the target word and the other sentences are both calculated and the matching degree is judged to be greater than the matching degree threshold, the counting number, which is the number of words matched with the target word in the word fragments of all the sentences in the sentence group, namely the number of words having the similarity with the target word semantic greater than the threshold, can be read from the counter, and the ratio of the counting number to the total number of words of all the sentences in the sentence group is calculated to be used as the score of the target word.

For example, assuming that the sentence group includes sentence 1, sentence 2 and sentence 3, sentence 1 includes word 11, word 12 and word 13, sentence 2 includes word 21 and word 22, and sentence 3 includes word 31, word 32, word 33 and word 34, for sentence 1, the similarity between word 11 and the vectors of word 21, word 22, word 31, word 32, word 33 and word 34 is calculated respectively, and when each similarity is greater than the threshold value, the counter accumulates 1, and if the number of counts of the final counter of word 11 is k, the ratio of k to p can be calculated as the score of word 11, where p is the total number of words of the sentence included in the sentence group, that is, p=9, that is, 9 words, and so on to calculate the score of each word.

S210, calculating the sum of scores of all the segmented words in the sentence group to serve as the score of the sentence group.

After the score of each word in each sentence in the sentence group is calculated, the sum of the scores of all the words in the sentence group is calculated to be used as the score of the sentence group.

S211, determining the sentence group with the highest score as a target sentence group.

Specifically, after the score of each sentence group is calculated, the sentence groups in the first order may be determined as target sentence groups according to the order of the scores from high to low.

S212, inputting sentences in the target sentence group into a pre-trained keyword vector generation model for a plurality of times to obtain a plurality of keyword vectors.

In this embodiment, when a keyword vector generation model inputs a sentence, word segmentation of the inputted sentence is randomly masked according to a preset proportion, and a vector of the sentence in which the word segmentation is masked according to the preset proportion is generated to be used as a keyword vector, and since the keyword vector generation model randomly masks the word segmentation of the inputted sentence according to the preset proportion, multiple sentence sentences in a target sentence group are circularly input into the keyword vector generation model for multiple times, so as to obtain multiple different keyword vectors, but semantic association exists between the multiple different keyword vectors.

S213, constructing a training sample by using the keyword vector of the document and the document index.

In this embodiment, the document index docidx of each document is constructed through S201-S204, and the plurality of keyword vectors queryvec of each document is constructed through S205-S212, a positive sample may be constructed by using the keyword vector and the document index of the same document, and a negative sample may be constructed by using the keyword vector and the document index of different documents, for example, the document index of the document a is docidx_a, the keyword vector queryvec_a, the document index of the document B is docidx_b, the keyword vector queryvec_b, and then positive samples (queryvec_a, docidx_a, queryvec_b, docidx_b) may be constructed, and negative samples (queryvec_b, docidx_a, and docidx_b) may be constructed.

S214, training a document search model by adopting a training sample, wherein the document search model outputs a document index when a search keyword is input.

In one embodiment, a document search model may be initialized, a sample is randomly extracted and input into the document search model to obtain a predicted document index, a loss value is calculated by using the predicted document index and the document index in a training sample, whether the loss value is smaller than a preset loss threshold value is judged, if yes, training of the document search model is stopped to obtain a trained document search model, if not, model parameters of the document search model are adjusted according to the loss value, and the step of randomly extracting the sample and inputting into the document search model to obtain the predicted document index is returned.

The document search model of the present embodiment includes an encoding network and a decoding network, the input of the encoding network is a keyword vector queryvec in a training sample, the encoding network outputs a feature vector X, the feature vector is input into the decoding network, so as to predict a document index docidx through the decoding network, wherein the decoding network can be expressed by the following formula:

in the above formula, X represents the output of the encoder, W is a model parameter, wi represents the model parameter of the ith step, idx _i The document index sequence, idoc is the document sequence, and N is the length of the document index sequence.

Wherein the loss value can be calculated by the following loss function:

wherein,

in the above formulas (4) and (5), q is a keyword, E (q) is a keyword vector, D is a set of the keyword vector E (q) and the document index docidx, P (docidx|E (q), W) is a probability of generating docidx when the keyword vector E (q) is used as an input, P (idx) ₁ |E(q),idx _{1,2,3,……,i-1} ,W _i ) Obtaining a document index idx for the first time of inputting a keyword vector E (q) ₁ Probability of Q (idx) ₁ |E(q),idx _{1,2,3,……,i-1} ,W _i ) Obtaining document index idx for the second time of inputting keyword vector E (q) inputted for the first time ₁ Is a function of the probability of (1), cross entropy loss for document index docidx, < >>The method comprises the steps of generating the regular terms of consistency of different document indexes docidx for the same keyword vector, namely, a document search model can output each document index and the probability of each document index, and then calculating a loss value through the formula.

After the loss value is calculated by the loss function described above, the model parameters may be adjusted by the loss value, and in one example, a gradient may be calculated from the loss value, and a gradient descent may be performed on the model parameters to adjust the model parameters.

The embodiment constructs an aggregation degree matrix based on document vectors of training documents, each element in the aggregation degree matrix is the similarity between two documents, a plurality of documents for training are divided into a preset number of document sets according to the aggregation degree matrix, document indexes of the documents are determined according to the positions of the documents in the document sets, a plurality of groups of sentence groups are extracted from each document, a target sentence group is determined, the target sentence group is the sentence group with the highest semantic relevance of each sentence, the sentences in the target sentence group are input into a keyword vector generation model trained in advance for a plurality of times to obtain a plurality of keyword vectors, training samples are constructed by adopting the keyword vectors of the documents and the document indexes to train a document search model, on one hand, the vectors of the plurality of sentences related in the semantic in the target sentence group are used as the keyword vectors of search keywords of the documents, training a document search model by constructing a training sample through a keyword vector and a document index, so that the document search model learns to a vector matching document index according to search keywords input by a user and words and sentences with the same semantic meaning as the search keywords, is suitable for a search scene with complex semantic meaning, can improve the recall rate of complex semantic meaning search, recall an accurate document index for complex semantic meaning search so as to search a target document through the document index, on the other hand, the document is divided into a plurality of document sets according to an aggregation degree matrix, the document in each document set is a document with high semantic relevance, and a path of the document is formed through the level of the document set, the position of the document set in the level and the position of the document in the document set so as to generate the document index, thereby realizing the realization of the steps of the document unstructured through the level, the position of the document set in the level, the position of the document in the document set is structured to obtain the document index, so that the document searching efficiency is improved.

Further, word segmentation is carried out on each sentence in the sentence group, the matching degree of each target word segment of each sentence in the sentence group and the word segments of other sentences is calculated, the score of the target word segment is obtained by calculating the ratio of the word segment number with the matching degree larger than a threshold value to the total word segment number, the score of all the word segments in the sentence group is calculated as the score of the sentence group, the sentence group with the highest score is determined as the target sentence group, the vector of the target sentence group is used as a plurality of keyword vectors of a document, the keyword vectors are different but are related semantically, and the document search model is trained through the keyword vectors and the document index, so that the document search model can search for the word and the sentence vector matching document index with the same semantic meaning as the search keyword.

Example III

Fig. 3 is a flowchart of a document searching method according to a third embodiment of the present invention, where the method may be performed by a document searching apparatus, which may be implemented in hardware and/or software, and the document searching apparatus may be configured in an electronic device, where the method is applicable to a case of searching a target document through search information input by a user. As shown in fig. 3, the document searching method includes:

S301, receiving search information input on a user terminal.

In one embodiment, the user terminal may be a self-service terminal in a government system, for example, may be a government service consultation terminal, and the user may input service information to be consulted, that is, search information, on the self-service terminal.

S302, inputting the search information into a document search model to obtain a document index.

In this embodiment, the document search model may be a model trained by the document search model training method provided in the first or second embodiment, which outputs one or more document indexes after inputting search information.

S303, searching a target document matched with the document index in a preset document database.

In this embodiment, a document database may be pre-constructed, where the document database includes document indexes of documents, specifically, in constructing the document database, a document vector of each document may be obtained by vectorizing a plurality of documents, and an aggregation degree matrix is constructed based on the document vector, where each element in the aggregation degree matrix is a similarity between two documents, and the plurality of documents are divided into a preset number of document sets according to the aggregation degree matrix, and the document indexes of the documents are determined according to positions of the documents in the document sets, where the construction of the document indexes may refer to S201-S204 in the second embodiment, which is not described herein.

S304, the target document is sent to the user terminal.

After predicting the document index through the document search model, the target document may be searched in a database through the document index and sent to the user terminal to display the target document at the user terminal, and illustratively, the text of the target document may be displayed at a display screen of the user terminal, and in another example, the target document may be converted into voice data, and the voice data may be played through the user terminal.

According to the text searching method, the document index is obtained by inputting the searching information into the pre-trained document searching model, so that the target document is searched through the document index, and when the document searching model is trained, the document searching model learns to match the document index according to the searching keywords input by the user and the vectors of the words and sentences with the same semantic as the searching keywords by taking the vectors of the plurality of semantically related sentences in the target sentence group of the document as the keyword vectors of the searching keywords of the document and the training samples constructed by the document index.

Example IV

Fig. 4 is a schematic structural diagram of a training device for document search model according to a fourth embodiment of the present invention. As shown in fig. 4, the document search model training apparatus includes:

a document index construction module 401 for constructing a document index of a training document;

a document sentence group extraction module 402, configured to extract, for each document, a plurality of sentence groups from the document, each sentence group including at least one sentence;

a keyword vector obtaining module 403, configured to determine a target sentence group from a plurality of sentence groups, and vectorize sentences in the target sentence group to be used as keyword vectors of search keywords of documents, where the target sentence group is a sentence group of semantic relevance of each sentence;

a training sample construction module 404, configured to construct a training sample using the keyword vector of the document and the document index;

the training module 405 is configured to train a document search model using training samples, where the document search model outputs a document index when a search keyword is input.

Optionally, the document index building module 401 includes:

the document vectorization unit is used for vectorizing the training documents to obtain document vectors of each document;

The aggregation degree matrix construction unit is used for constructing an aggregation degree matrix based on the document vector, wherein each element in the aggregation degree matrix is the similarity between two documents;

the document dividing unit is used for dividing a plurality of documents for training into a preset number of document sets according to the aggregation degree matrix;

a document index generating unit for determining a document index of the document according to the position of the document in the document collection.

Optionally, the polymerization degree matrix construction unit includes:

a similarity calculation subunit for calculating the similarity of the document vectors of the two documents;

the association matrix construction subunit is used for constructing an association matrix according to the similarity, wherein each element in the association matrix represents whether two documents are associated, wherein the element value is 0 when the similarity of the document vectors of the two documents is smaller than a preset threshold value, the two documents are not associated, and the element value is 1 when the similarity of the document vectors of the two documents is larger than or equal to the preset threshold value, and the two documents are associated;

a degree matrix construction subunit, configured to construct a degree matrix according to the association degree matrix, where each element on a diagonal line in the degree matrix represents the number of documents associated with each document;

And the aggregation degree matrix generation subunit is used for calculating the aggregation degree of the two documents according to the similarity of the document vectors of the two documents and the degree of the two documents in the degree matrix so as to generate an aggregation degree matrix.

Optionally, the document dividing unit includes:

the objective function construction subunit is used for constructing an objective function, and the objective function is used for calculating the aggregation degree loss after a plurality of documents are divided into a preset number of document sets, wherein the aggregation degree loss is used for representing the aggregation degree lost after the associated documents are divided into different document sets;

and the solving subunit is used for solving the document set when the function value of the objective function is minimum, and obtaining a divided preset number of document sets.

Optionally, the document dividing unit further includes:

a document number counting subunit, configured to count the number of documents in each document set;

a document number judging subunit, configured to judge whether a target document set with a document number greater than a document number threshold exists;

the document dividing subunit is used for dividing the documents in the target document set according to the aggregation degree matrix to obtain a sub-document set of the target document set, wherein the number of the documents included in the sub-document set is smaller than the threshold value of the number of the documents.

Optionally, the document set has a plurality of levels, and the document index generating unit includes:

a document position determining subunit, configured to determine a hierarchy of a document set in which the document is located, a hierarchy position of the document set in the hierarchy, and a document position of the document in the document set;

and the document index coding subunit is used for coding the hierarchy, the hierarchy position and the document position to obtain the document index of the document.

Optionally, the keyword vector obtaining module 403 includes:

the word segmentation unit is used for segmenting each sentence in the sentence group aiming at each sentence group to obtain the word segmentation of each sentence;

the word segmentation matching degree calculation unit is used for calculating the matching degree of each target word segmentation of each sentence in the sentence group and the word segmentation of other sentences;

the counter counting unit is used for accumulating the counter of the target word segmentation by 1 when the matching degree is larger than a preset matching degree threshold value;

the word segmentation score calculation unit is used for calculating the ratio of the counting number of the counter to the total number of the segmented words in the sentence group after the matching degree calculation of the target segmented words and the segmented words of other sentences is finished, so as to be used as the score of the target segmented words;

a sentence group score calculating unit for calculating a sum of scores of all the divided words in the sentence group as a score of the sentence group;

And the target sentence group determining unit is used for determining the sentence group with the highest score as the target sentence group.

Optionally, the keyword vector obtaining module 403 includes:

and the sentence input unit is used for inputting sentences in the target sentence group into the pre-trained keyword vector generation model for a plurality of times to obtain a plurality of keyword vectors.

Optionally, the training sample construction module 404 includes:

a positive sample construction unit for constructing a positive sample by using the keyword vector and the document index of the same document;

and the negative sample construction unit is used for constructing a negative sample by adopting keyword vectors and document indexes of different documents.

Optionally, the training module 405 includes:

the model initializing unit is used for initializing a document searching model;

the prediction unit is used for randomly extracting samples and inputting the samples into the document search model to obtain a predicted document index;

a loss value calculation unit for calculating a loss value by using the predicted document index and the document index in the training sample;

the loss value judging unit is used for judging whether the loss value is smaller than a preset loss threshold value or not;

the training stopping unit is used for stopping training the document searching model to obtain a trained document searching model;

And the model parameter adjusting unit is used for adjusting the model parameters of the document search model according to the loss value.

The document search model training device provided by the embodiment of the invention can execute the document search model training method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the executing method.

Example five

Fig. 5 is a schematic structural diagram of a document searching apparatus according to a fifth embodiment of the present invention. As shown in fig. 5, the document searching apparatus includes:

a search information receiving module 501, configured to receive search information input on a user terminal;

a document index prediction module 502, configured to input search information into a document search model to obtain a document index;

a target document searching module 503, configured to search a preset document database for a target document matching the document index;

a target document transmission module 504 for transmitting the target document to the user terminal;

the document search model is trained by the document search model training method provided in the first embodiment or the second embodiment.

Optionally, the document database is built up by the modules:

the document vectorization module is used for vectorizing a plurality of documents to obtain a document vector of each document;

The aggregation degree matrix construction module is used for constructing an aggregation degree matrix based on the document vector, wherein each element in the aggregation degree matrix is the similarity between two documents;

the document dividing module is used for dividing the plurality of documents into a preset number of document sets according to the aggregation degree matrix;

and the document index generating module is used for determining the document index of the document according to the position of the document in the document set.

The document searching device provided by the embodiment of the invention can execute the document searching method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the executing method.

Example six

Fig. 6 shows a schematic diagram of an electronic device 40 that may be used to implement an embodiment of the invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic equipment may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.

As shown in fig. 6, the electronic device 40 includes at least one processor 41, and a memory communicatively connected to the at least one processor 41, such as a Read Only Memory (ROM) 42, a Random Access Memory (RAM) 43, etc., in which the memory stores a computer program executable by the at least one processor, and the processor 41 may perform various suitable actions and processes according to the computer program stored in the Read Only Memory (ROM) 42 or the computer program loaded from the storage unit 48 into the Random Access Memory (RAM) 43. In the RAM 43, various programs and data required for the operation of the electronic device 40 may also be stored. The processor 41, the ROM 42 and the RAM 43 are connected to each other via a bus 44. An input/output (I/O) interface 45 is also connected to bus 44.

Various components in electronic device 40 are connected to I/O interface 45, including: an input unit 46 such as a keyboard, a mouse, etc.; an output unit 47 such as various types of displays, speakers, and the like; a storage unit 48 such as a magnetic disk, an optical disk, or the like; and a communication unit 49 such as a network card, modem, wireless communication transceiver, etc. The communication unit 49 allows the electronic device 40 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The processor 41 may be various general and/or special purpose processing components with processing and computing capabilities. Some examples of processor 41 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, digital Signal Processors (DSPs), and any suitable processor, controller, microcontroller, etc. Processor 41 performs the various methods and processes described above, such as a document search model training method, and/or a document search method.

In some embodiments, the document search model training method, and/or the document search method, may be implemented as a computer program tangibly embodied on a computer-readable storage medium, such as the storage unit 48. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 40 via the ROM 42 and/or the communication unit 49. When the computer program is loaded into RAM 43 and executed by processor 41, one or more steps of the document search model training method described above, and/or the document search method, may be performed. Alternatively, in other embodiments, processor 41 may be configured to perform the document search model training method, and/or the document search method, in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here above can be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

A computer program for carrying out methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be implemented. The computer program may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. The computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) through which a user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.

The computing system may include clients and servers. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service are overcome.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present invention may be performed in parallel, sequentially, or in a different order, so long as the desired results of the technical solution of the present invention are achieved, and the present invention is not limited herein.

The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims

1. A document search model training method, comprising:

constructing a document index of a training document;

2. The document search model training method according to claim 1, wherein the constructing a document index of the training document comprises:

vectorizing the training documents to obtain document vectors of each document;

constructing an aggregation degree matrix based on the document vector, wherein each element in the aggregation degree matrix is the similarity between two documents;

dividing a plurality of documents for training into a preset number of document sets according to the aggregation degree matrix;

a document index of the document is determined based on a position of the document in the document collection.

3. The document search model training method according to claim 2, wherein the constructing an aggregation degree matrix based on the document vector includes:

calculating the similarity of the document vectors of the two documents;

constructing an association matrix according to the similarity, wherein each element in the association matrix represents whether two documents are associated, wherein an element value is 0 when the similarity of document vectors of the two documents is smaller than a preset threshold value, the two documents are not associated, and an element value is 1 when the similarity of document vectors of the two documents is larger than or equal to the preset threshold value, and the two documents are associated;

Constructing a degree matrix according to the association degree matrix, wherein each element on a diagonal line in the degree matrix represents the number of documents associated with each document;

and calculating the aggregation degree of the two documents according to the similarity of the document vectors of the two documents and the degrees of the two documents in the degree matrix to generate an aggregation degree matrix.

4. The document search model training method according to claim 2, wherein the dividing the plurality of documents for training into a preset number of document sets according to the aggregation degree matrix includes:

constructing an objective function, wherein the objective function is used for calculating aggregation degree loss after a plurality of documents are divided into a preset number of document sets, and the aggregation degree loss is used for representing the aggregation degree lost after the associated documents are divided into different document sets;

and solving the document set when the function value of the objective function is minimum, and obtaining a divided preset number of document sets.

5. The document search model training method according to claim 4, further comprising, after solving the document set in which the function value of the objective function is minimized, after obtaining the divided preset number of document sets:

Counting the number of the documents in each document set;

judging whether a target document set with the number of the documents being larger than a threshold value of the number of the documents exists or not;

if so, dividing the documents in the target document set according to the aggregation degree matrix to obtain a sub-document set of the target document set, wherein the number of the documents included in the sub-document set is smaller than the document number threshold.

6. The document search model training method of claim 2, wherein the document collection has a plurality of hierarchical levels, the determining the document index of the document based on the location of the document in the document collection comprising:

determining a hierarchy of a document set in which the document is located, a hierarchy position of the document set in the hierarchy, and a document position of the document in the document set;

and encoding the hierarchy, the hierarchy position and the document position to obtain a document index of the document.

7. The document search model training method according to claim 1, wherein the determining a target sentence group from the plurality of sentence groups includes:

aiming at each sentence group, word segmentation is carried out on each sentence in the sentence group to obtain word segmentation of each sentence;

Calculating the matching degree of each target word segment of each sentence in the sentence group and the word segments of other sentences;

when the matching degree is larger than a preset matching degree threshold value, accumulating 1 by a counter of the target word segmentation;

after the matching degree calculation of the target word and the word of the other sentences is finished, calculating the ratio of the counting number of the counter to the total number of the word in the sentence group to be used as the score of the target word;

calculating the sum of the scores of all the word segmentation in the sentence group to be used as the score of the sentence group;

and determining the sentence group with the highest score as a target sentence group.

8. The document search model training method of claim 1, wherein vectorizing the sentences in the target sentence group comprises:

and inputting sentences in the target sentence group into a pre-trained keyword vector generation model for a plurality of times to obtain a plurality of keyword vectors.

9. The document search model training method according to any one of claims 1 to 8, wherein said constructing training samples using the keyword vector of the document and the document index comprises:

constructing a positive sample by adopting a keyword vector and a document index of the same document;

And constructing a negative sample by using the keyword vectors and the document indexes of different documents.

10. The document search model training method according to any one of claims 1 to 8, wherein the training of the document search model using the training sample includes:

initializing a document search model;

randomly extracting a sample and inputting the sample into a document searching model to obtain a predicted document index;

calculating a loss value by adopting a predicted document index and a document index in a training sample;

judging whether the loss value is smaller than a preset loss threshold value or not;

if yes, stopping training the document searching model to obtain a trained document searching model;

if not, the model parameters of the document searching model are adjusted according to the loss value, and the step of randomly extracting samples and inputting the samples into the document searching model to obtain the predicted document index is returned.

11. A document searching method, comprising:

receiving search information input on a user terminal;

transmitting the target document to the user terminal;

Wherein the document search model is trained by the document search model training method of any one of claims 1 to 10.

12. The document searching method according to claim 11, wherein the document database is established by:

vectorizing a plurality of documents to obtain a document vector of each document;

dividing a plurality of documents into a preset number of document sets according to the aggregation degree matrix;

13. A document search model training apparatus, comprising:

14. A document searching apparatus, characterized by comprising:

15. An electronic device, the electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the document search model training method of any one of claims 1-10 and/or the document search method of any one of claims 11-12.

16. A computer readable storage medium storing computer instructions for causing a processor to perform the document search model training method of any one of claims 1-10 and/or the document search method of any one of claims 11-12 when executed.