CN116842138A - Document-based retrieval method, device, equipment and storage medium - Google Patents
Document-based retrieval method, device, equipment and storage medium Download PDFInfo
- Publication number
- CN116842138A CN116842138A CN202310906497.2A CN202310906497A CN116842138A CN 116842138 A CN116842138 A CN 116842138A CN 202310906497 A CN202310906497 A CN 202310906497A CN 116842138 A CN116842138 A CN 116842138A
- Authority
- CN
- China
- Prior art keywords
- document
- text
- segment
- target
- target text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 53
- 230000011218 segmentation Effects 0.000 claims abstract description 29
- 239000013598 vector Substances 0.000 claims description 46
- 238000004891 communication Methods 0.000 claims description 15
- 230000000873 masking effect Effects 0.000 claims description 15
- 238000004458 analytical method Methods 0.000 claims description 8
- 238000004590 computer program Methods 0.000 claims description 7
- 238000004364 calculation method Methods 0.000 claims description 4
- 238000001816 cooling Methods 0.000 claims description 4
- 238000000605 extraction Methods 0.000 description 10
- 230000009466 transformation Effects 0.000 description 9
- 238000010586 diagram Methods 0.000 description 6
- 230000008569 process Effects 0.000 description 5
- 238000012545 processing Methods 0.000 description 4
- 238000004422 calculation algorithm Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 239000004973 liquid crystal related substance Substances 0.000 description 2
- 238000010295 mobile communication Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000012163 sequencing technique Methods 0.000 description 2
- 230000008094 contradictory effect Effects 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000002441 reversible effect Effects 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000012549 training Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application relates to a retrieval method, a device, equipment and a storage medium based on a document, wherein the method comprises the following steps: obtaining a document to be searched, extracting a plurality of keywords of the document to be searched, filling the keywords into a preset template text set to obtain at least one target text of the document to be searched, generating a synonymous text corresponding to each target text according to the corresponding segmentation in each target text, searching out the target text and a candidate document set associated with the synonymous text from a preset database according to each target text and the synonymous text corresponding to each target text, calculating a target score of each candidate document in the candidate document set, sorting each candidate document according to the target scores, and taking the sorting result as a retrieval result of the document to be searched. The application can promote the comprehensiveness of the search.
Description
Technical Field
The present application relates to the field of document retrieval technologies, and in particular, to a document-based retrieval method, apparatus, device, and storage medium.
Background
At present, document retrieval is usually carried out in a retrieval database according to keywords or abstract text segments of a document to be retrieved, and the retrieval result is incomplete because the keywords and abstract text segments are generally lack of expansion scenes of document core contents and core content transformation scenes.
Therefore, providing a method for improving the comprehensiveness of the search result has become a technical problem to be solved by those skilled in the art.
Disclosure of Invention
In view of the above, the present application provides a document-based retrieval method, device, apparatus and storage medium, and aims to solve the above technical problems.
In a first aspect, the present application provides a document-based retrieval method, the method comprising:
acquiring a document to be searched, and extracting a plurality of keywords of the document to be searched;
filling the keywords into a preset template text set to obtain at least one target text of the document to be retrieved;
generating a synonymous text segment corresponding to each target text segment according to the corresponding word segmentation in each target text segment;
according to each target text segment and the corresponding synonymous text segment of the target text segment, searching the target text segment and a candidate document set associated with the synonymous text segment from a preset database;
and calculating a target score of each candidate document in the candidate document set, sorting each candidate document according to the target scores, and taking the sorting result as a retrieval result of the document to be retrieved.
Preferably, the filling the keywords into a preset template text set to obtain at least one target text of the document to be retrieved includes:
identifying the part of speech of each keyword by using a lexical analysis model;
filling the keywords into blank positions of the preset template text sets according to the part of speech of each keyword to obtain a plurality of initial text sets;
deleting the grammar-incorrect text segment in the plurality of initial text segments to obtain the at least one target text segment.
Preferably, the generating, according to the corresponding word segmentation in each target segment, a synonymous segment corresponding to each target segment includes:
executing word segmentation operation on the target text segment to obtain a plurality of word segments corresponding to the target text segment;
sequentially executing covering operation on each word segmentation in the target text segment to obtain a plurality of covering text segments corresponding to the target text segment;
inputting each masking text segment into a BERT model respectively to obtain a plurality of synonyms of masking positions in each masking text segment;
and executing replacement operation on the segmentation in the target text segment based on the synonyms to obtain a synonym text segment corresponding to the target text segment.
Preferably, the retrieving, according to each target text segment and the synonymous text segment corresponding to the target text segment, the candidate document set associated with the target text segment and the synonymous text segment from a preset database includes:
retrieving a first document set associated with each target text segment from a preset database;
retrieving a second document set associated with each synonymous text segment from a preset database;
and deleting the documents of the first document set and the repeated documents of the first document set to obtain the candidate document set.
Preferably, the calculating a target score for each candidate document in the candidate document set includes:
calculating the word set similarity of the keyword label words of the candidate document and the keyword of the document to be searched, and taking the word set similarity as a first score of the candidate document;
calculating the similarity of the abstract text of the candidate document and the text of the target text of the document to be searched, and taking the similarity of the text as a second score of the candidate document;
calculating a third score of the candidate document according to the search information of the candidate document;
and calculating the target score of the candidate document according to the first score, the second score, the third score and the preset weight.
Preferably, the calculating the third score of the candidate document according to the search information of the candidate document includes calculating the third score using a formula including:
wherein S (n) represents a third score, T, for the nth candidate document 0 Represents initial search weight, alpha is preset cooling coefficient, D n Represents the last time the nth candidate document was searched from the current time, D 0 Representing the current time.
Preferably, the calculating the similarity between the abstract text of the candidate document and the text of the target text of the document to be retrieved, taking the similarity between the abstract text and the target text as the second score of the candidate document, includes:
performing clause operation on the abstract text segment and the target text segment respectively to obtain a clause of the abstract segment and a clause of the target text segment;
converting the clause of the abstract segment into a first sentence vector set, and splicing sentence vectors of the first sentence vector set to obtain a text Duan Xiangliang of the abstract text segment;
converting the clause of the target text segment into a second sentence vector set, and splicing sentence vectors of the second sentence vector set to obtain a text Duan Xiangliang of the target text segment;
and calculating the similarity of the segment vector of the abstract segment and the segment vector of the target segment to serve as a second score of the candidate document.
In a second aspect, the present application provides a document-based retrieval apparatus comprising:
the acquisition module is used for: the method comprises the steps of obtaining a document to be searched, and extracting a plurality of keywords of the document to be searched;
and (3) filling a module: the method comprises the steps of filling the keywords into a preset template text set to obtain at least one target text of the document to be retrieved;
the generation module is used for: the method comprises the steps of generating synonymous text segments corresponding to each target text segment according to corresponding segmentation in each target text segment;
and a retrieval module: the candidate document set is used for retrieving the target text and the candidate document set associated with the synonymous text from a preset database according to each target text and the synonymous text corresponding to the target text;
the calculation module: and the method is used for calculating the target score of each candidate document in the candidate document set, sorting each candidate document according to the target scores, and taking the sorting result as the retrieval result of the document to be retrieved.
In a third aspect, the present application provides an electronic device, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus;
a memory for storing a computer program;
and a processor, configured to implement the steps of the document-based retrieval method according to any one of the embodiments of the first aspect when executing the program stored in the memory.
In a fourth aspect, a computer-readable storage medium is provided, on which a computer program is stored which, when being executed by a processor, implements the steps of the document based retrieval method according to any one of the embodiments of the first aspect.
Compared with the prior art, the technical scheme provided by the embodiment of the application has the following advantages:
according to the method, the plurality of keywords of the document to be searched are extracted, the keywords are filled into the preset template text set to obtain at least one target text of the document to be searched, the extracted keywords are combined with the preset template text set to generate the target text which characterizes more expansion scenes and transformation scenes, the synonym text corresponding to each target text is generated according to the corresponding segmentation words in each target text, and the searching range of the document to be searched can be further enlarged. According to each target text segment and the corresponding candidate document set associated with the corresponding synonymous text segment, the target score of each candidate document in the candidate document set is calculated, each candidate document is ranked according to the target scores, and the ranking result is used as the retrieval result of the document to be retrieved. As the target text and the synonymous text represent more expansion scenes and transformation scenes, the comprehensiveness of the search is improved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.
In order to more clearly illustrate the embodiments of the application or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, and it will be obvious to a person skilled in the art that other drawings can be obtained from these drawings without inventive effort.
FIG. 1 is a schematic view of an application environment of a document-based retrieval method of the present application;
FIG. 2 is a flow chart of calculating target scores of candidate documents according to the present application;
FIG. 3 is a flow chart illustrating a method for calculating a second score of a candidate document according to the present application;
FIG. 4 is a schematic block diagram of a preferred embodiment of a document-based search apparatus according to the present application;
FIG. 5 is a schematic diagram of an electronic device according to a preferred embodiment of the present application;
the achievement of the objects, functional features and advantages of the present application will be further described with reference to the accompanying drawings, in conjunction with the embodiments.
Detailed Description
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
It should be noted that the description of "first", "second", etc. in this disclosure is for descriptive purposes only and is not to be construed as indicating or implying a relative importance or implying an indication of the number of technical features being indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In addition, the technical solutions of the embodiments may be combined with each other, but it is necessary to base that the technical solutions can be realized by those skilled in the art, and when the technical solutions are contradictory or cannot be realized, the combination of the technical solutions should be considered to be absent and not within the scope of protection claimed in the present application.
Referring to FIG. 1, a method flow diagram of an embodiment of a document-based retrieval method of the present application is shown. The method may be performed by an electronic device, which may be implemented in software and/or hardware, for example, the electronic device may be a data center server or a cloud server, or may be a terminal device. The document-based retrieval method comprises the following steps:
step S1: acquiring a document to be searched, and extracting a plurality of keywords of the document to be searched;
step S2: filling the keywords into a preset template text set to obtain at least one target text of the document to be retrieved;
step S3: generating a synonymous text segment corresponding to each target text segment according to the corresponding word segmentation in each target text segment;
step S4: according to each target text segment and the corresponding synonymous text segment of the target text segment, searching the target text segment and a candidate document set associated with the synonymous text segment from a preset database;
step S5: and calculating a target score of each candidate document in the candidate document set, sorting each candidate document according to the target scores, and taking the sorting result as a retrieval result of the document to be retrieved.
In this embodiment, the user may upload the document to be searched in the search interface, or may input the document to be searched from the search interface, where the format of the document to be searched may be a word document, a notepad document, a PDF document, or the like.
After the document to be searched is obtained, a plurality of keywords of the document to be searched are extracted, the keywords of the document to be searched can be identified through a pre-trained semantic analysis model, the semantic analysis model can be obtained based on training of an HMM model, the document to be searched is input into a value semantic analysis model, and the model can output the semantic keywords of the document to be searched.
The maximum forward matching algorithm or the maximum reverse matching algorithm can be used for executing word segmentation operation on the document to be searched, counting the occurrence frequency of all word segments of the document to be searched, calculating IDF (inverse document frequency value), and then calculating TF (word frequency) value of each word in the document to be searched. Wherein tf= (number of occurrences of the term in the document)/(sum of the number of occurrences of each term in the document), and the TF value is multiplied by the IDF value to obtain TF-IDF value of each term, the larger the TF-IDF value is, the higher the priority of the term as a keyword is. That is, the larger the TF-IDF value, the higher the importance of the term to the document to be retrieved, so that several terms with TF-IDF values arranged in front may be used as keywords of the document to be retrieved, for example, the term with TF-IDF value arranged in front 10 may be selected as keywords of the document to be retrieved.
Compared with keywords, the text can more accurately represent the core content of the document to be searched, but the abstract text in the document to be searched usually only briefly describes the core content of the document, and the abstract text usually lacks the expansion scene of the core content and the transformation scene of the core content, so that the extracted keywords are combined with a preset template text set to generate the target text which can represent more expansion scenes and transformation scenes. The preset template text set comprises a plurality of template text, the template text can be preconfigured according to actual scene requirements, keywords are filled in the template text, target text of a plurality of scenes corresponding to the document to be searched can be obtained, and the target text comprises an expansion scene and a transformation scene of the document to be searched. Specifically, filling the keywords into a preset template text set to obtain at least one target text of the document to be retrieved, including:
identifying the part of speech of each keyword by using a lexical analysis model;
filling the keywords into blank positions of the preset template text sets according to the part of speech of each keyword to obtain a plurality of initial text sets;
deleting the grammar-incorrect text segment in the plurality of initial text segments to obtain the at least one target text segment.
The part of speech of each keyword, such as a common noun, an azimuthal noun, a common verb, an adjective, a pronoun, etc., can be identified through a lexical analysis model. According to the part of speech of each keyword, filling a plurality of keywords corresponding to a document to be searched into blank positions of a plurality of preset template segments to obtain a plurality of initial segments, wherein each initial segment is generated after the corresponding template segment is filled with the keywords, and because the initial segments after the keywords are filled possibly have the conditions of semantic logic errors or grammar errors, the segments with logic errors and grammar errors in the initial segments are required to be deleted, and the segments obtained after the deletion operation in the initial segments are used as target segments.
In order to further expand the retrieval range of the document to be retrieved, after obtaining the target text segments representing the expansion scene and the transformation scene of the document to be retrieved, generating the synonymous text segment corresponding to each target text segment according to the corresponding segmentation word in each target text segment, thereby expanding the text segment associated with the target text segment, wherein the synonymous text segment can be obtained by matching from preset text Duan Ku according to semantic similarity. In one embodiment, generating the synonymous text segment corresponding to each target text segment according to the corresponding segmentation word in each target text segment includes:
executing word segmentation operation on the target text segment to obtain a plurality of word segments corresponding to the target text segment;
sequentially executing covering operation on each word segmentation in the target text segment to obtain a plurality of covering text segments corresponding to the target text segment;
inputting each masking text segment into a BERT model respectively to obtain a plurality of synonyms of masking positions in each masking text segment;
and executing replacement operation on the segmentation in the target text segment based on the synonyms to obtain a synonym text segment corresponding to the target text segment.
Assuming that the target text is a single sentence, taking the single sentence as an example of "feature extraction which is the key of image recognition", the word segmentation operation is performed on the single sentence to obtain the word segments including "image", "recognition", "key", "yes", "feature" and "extraction".
Sequentially executing a covering operation on each word segmentation in the target text segment to obtain a plurality of covering text segments corresponding to the target text segment, wherein the obtained covering text segments comprise:
key to mask recognition is feature extraction
The key to image mask is feature extraction
Image recognition mask key is feature extraction
Mask for image recognition is feature extraction
Key mask feature extraction for image recognition
The key of image recognition is mask extraction
The key of image recognition is a feature mask
And respectively inputting each masking text segment into the BERT model, so that a plurality of synonyms of the masking position of each masking text segment can be obtained. For example, the synonyms of the image may include "picture", "portrait", and the like, and the replacement operation is performed on the word segments in the target segment according to the obtained synonyms, so as to obtain the synonym segment corresponding to the target segment, for example, "the key of picture recognition is feature extraction", "the key of portrait recognition is feature extraction". Because the target text segment comprises the expansion scene and the transformation scene of the document to be searched, the synonymous text segment corresponding to the target text segment can further expand the scene of the document to be searched.
After the synonymous text corresponding to the target text is obtained, according to each target text and synonymous text, a candidate document set associated with the target text and synonymous text is retrieved from a preset database, wherein the preset database can be a local database or a third party database, and as the target text and the synonymous text both represent the expansion scene and the transformation scene of the document to be retrieved, the association between the candidate document set obtained through the retrieval of the target text and the synonymous text and the document to be retrieved is higher. For example, the candidate document set associated with the target text and the synonymous text is retrieved from the database by using the semantic information of the target text and the synonymous text, or the candidate document set associated with the target text and the synonymous text is retrieved from the database by using the keyword information of the target text and the synonymous text. Specifically, according to each target text segment and a synonymous text segment corresponding to the target text segment, a candidate document set associated with the target text segment and the synonymous text segment is retrieved from a preset database, including:
retrieving a first document set associated with each target text segment from a preset database;
retrieving a second document set associated with each synonymous text segment from a preset database;
and deleting the documents of the first document set and the repeated documents of the first document set to obtain the candidate document set.
The documents associated with each target segment are retrieved from the database and noted as a first set of documents. The documents associated with each synonymous segment are retrieved from the database and noted as a second set of documents. Since there may be duplicate documents in the first document set and the second document set, it is necessary to perform a deduplication operation of the documents, thereby obtaining a candidate document set.
After the candidate document set is obtained, calculating a target score of each candidate document in the candidate document set, wherein the target score characterizes the relevance of the candidate document and the document to be searched, and the higher the target score is, the higher the relevance of the candidate document and the document to be searched is. And sequencing each candidate document according to the target score, and taking the sequencing result as a retrieval result of the document to be retrieved. Referring to fig. 2, a flow chart of calculating target scores of candidate documents according to the present application is shown, wherein the flow of calculating target scores includes:
step S51: calculating the word set similarity of the keyword label words of the candidate document and the keyword of the document to be searched, and taking the word set similarity as a first score of the candidate document;
step S52: calculating the similarity of the abstract text of the candidate document and the text of the target text of the document to be searched, and taking the similarity of the text as a second score of the candidate document;
step S53: calculating a third score of the candidate document according to the search information of the candidate document;
step S54: and calculating the target score of the candidate document according to the first score, the second score, the third score and the preset weight.
The key tag words of the candidate documents can be directly obtained from a database, each document in the database has a corresponding key tag word, and the key tag words can be used for classifying the documents in the database so as to facilitate retrieval. The keyword tag of each candidate document is respectively used as a word set, the keyword of the document to be searched is used as another word set, the word set similarity between the keyword tag of the candidate document and the keyword of the document to be searched is calculated, and the calculated word set similarity is used as a first score of the candidate document;
the abstract text of the candidate document can also be directly obtained from a database, each document in the database has a corresponding abstract text, and a user can quickly know the core content of the document through the abstract text. And calculating the similarity of the abstract text of each candidate document and the text of the target text of the document to be retrieved, and taking the calculated similarity of the text as the second score of the candidate document. Referring to fig. 3, which is a schematic flow chart of calculating a second score of a candidate document according to the present application, specifically, calculating a similarity between a summary segment of the candidate document and a target segment of the document to be retrieved, and using the similarity between the segments as the second score of the candidate document, the method includes:
step S521: performing clause operation on the abstract text segment and the target text segment respectively to obtain a clause of the abstract segment and a clause of the target text segment;
step S522: converting the clause of the abstract segment into a first sentence vector set, and splicing sentence vectors of the first sentence vector set to obtain a text Duan Xiangliang of the abstract text segment;
step S523: converting the clause of the target text segment into a second sentence vector set, and splicing sentence vectors of the second sentence vector set to obtain a text Duan Xiangliang of the target text segment;
step S524: and calculating the similarity of the segment vector of the abstract segment and the segment vector of the target segment to serve as a second score of the candidate document.
And respectively dividing sentences of the abstract text segment and the target text segment to obtain the sentence of the abstract segment and the sentence of the target text segment, converting the sentence of the abstract segment into a plurality of sentence vectors by using a word2vec model, marking the sentence vectors as a first sentence vector set, and splicing the sentence vectors of the first sentence vector set to obtain the text segment vector of the abstract text segment. And converting the clauses of the target text segment into a plurality of sentence vectors, recording the sentence vectors as a second sentence vector set, and splicing the sentence vectors of the second sentence vector set to obtain the text segment vector of the target text segment. And calculating the similarity of the segment vector of the abstract segment and the segment vector of the target segment to obtain a second score of the candidate document.
Search information of each candidate document, for example, the number of searches, the number of browses, the search date, and the like is acquired. Calculating a third score for each candidate document based on the search information for the candidate document, in particular, calculating the third score for the candidate document based on the search information for the candidate document, comprises calculating the third score using a formula comprising:
wherein S (n) represents a third score, T, for the nth candidate document 0 Represents initial search weight, alpha is preset cooling coefficient, D n Represents the last time the nth candidate document was searched from the current time, D 0 Representing the current time. This time may be in hours, for example, the time since the last time the current time was searched is 10:09 minutes, the current time is 17:30, and the time difference is 7 hours. So as to have the following period of timeDocuments searched by other users may obtain a higher third score, with non-searched documents decreasing with time.
And calculating the target score of the candidate document according to the first score, the second score, the third score and the preset weight, wherein the weight of the first score is 0.4, the weight of the second score is 0.4, the weight of the third score is 0.2, and the target score of the candidate document is calculated according to the preset weight.
Referring to fig. 4, a functional block diagram of a document-based search apparatus 100 according to the present application is shown.
The document-based retrieval apparatus 100 of the present application may be installed in an electronic device. Depending on the functionality implemented, the document based retrieval device 100 may include an acquisition module 110, a population module 120, a generation module 130, a retrieval module 140, and a calculation module 150. The module of the application, which may also be referred to as a unit, refers to a series of computer program segments, which are stored in the memory of the electronic device, capable of being executed by the processor of the electronic device and of performing a fixed function.
In the present embodiment, the functions concerning the respective modules/units are as follows:
the acquisition module is used for: the method comprises the steps of obtaining a document to be searched, and extracting a plurality of keywords of the document to be searched;
and (3) filling a module: the method comprises the steps of filling the keywords into a preset template text set to obtain at least one target text of the document to be retrieved;
the generation module is used for: the method comprises the steps of generating synonymous text segments corresponding to each target text segment according to corresponding segmentation in each target text segment;
and a retrieval module: the candidate document set is used for retrieving the target text and the candidate document set associated with the synonymous text from a preset database according to each target text and the synonymous text corresponding to the target text;
the calculation module: and the method is used for calculating the target score of each candidate document in the candidate document set, sorting each candidate document according to the target scores, and taking the sorting result as the retrieval result of the document to be retrieved.
In one embodiment, the filling the keywords into a preset template segment set to obtain at least one target segment of the document to be retrieved includes:
identifying the part of speech of each keyword by using a lexical analysis model;
filling the keywords into blank positions of the preset template text sets according to the part of speech of each keyword to obtain a plurality of initial text sets;
deleting the grammar-incorrect text segment in the plurality of initial text segments to obtain the at least one target text segment.
In one embodiment, the generating, according to the corresponding segmentation in each target segment, a synonymous segment corresponding to each target segment includes:
executing word segmentation operation on the target text segment to obtain a plurality of word segments corresponding to the target text segment;
sequentially executing covering operation on each word segmentation in the target text segment to obtain a plurality of covering text segments corresponding to the target text segment;
inputting each masking text segment into a BERT model respectively to obtain a plurality of synonyms of masking positions in each masking text segment;
and executing replacement operation on the segmentation in the target text segment based on the synonyms to obtain a synonym text segment corresponding to the target text segment.
In one embodiment, the retrieving, according to each target segment and a synonymous segment corresponding to the target segment, a candidate document set associated with the target segment and the synonymous segment from a preset database includes:
retrieving a first document set associated with each target text segment from a preset database;
retrieving a second document set associated with each synonymous text segment from a preset database;
and deleting the documents of the first document set and the repeated documents of the first document set to obtain the candidate document set.
In one embodiment, the calculating the target score for each candidate document in the set of candidate documents includes:
calculating the word set similarity of the keyword label words of the candidate document and the keyword of the document to be searched, and taking the word set similarity as a first score of the candidate document;
calculating the similarity of the abstract text of the candidate document and the text of the target text of the document to be searched, and taking the similarity of the text as a second score of the candidate document;
calculating a third score of the candidate document according to the search information of the candidate document;
and calculating the target score of the candidate document according to the first score, the second score, the third score and the preset weight.
In one embodiment, the calculating the third score for the candidate document based on the search information for the candidate document includes calculating the third score using a formula including:
wherein S (n) represents a third score, T, for the nth candidate document 0 Represents initial search weight, alpha is preset cooling coefficient, D n Represents the last time the nth candidate document was searched from the current time, D 0 Representing the current time.
In one embodiment, the calculating the similarity between the abstract text of the candidate document and the target text of the document to be retrieved, taking the similarity between the abstract text and the target text as the second score of the candidate document includes:
performing clause operation on the abstract text segment and the target text segment respectively to obtain a clause of the abstract segment and a clause of the target text segment;
converting the clause of the abstract segment into a first sentence vector set, and splicing sentence vectors of the first sentence vector set to obtain a text Duan Xiangliang of the abstract text segment;
converting the clause of the target text segment into a second sentence vector set, and splicing sentence vectors of the second sentence vector set to obtain a text Duan Xiangliang of the target text segment;
and calculating the similarity of the segment vector of the abstract segment and the segment vector of the target segment to serve as a second score of the candidate document.
Referring to fig. 5, a schematic diagram of a preferred embodiment of an electronic device 1 according to the present application is shown.
The electronic device 1 includes, but is not limited to: memory 11, processor 12, display 13, and communication interface 14. The electronic device 1 may be connected to a network via a communication interface 14. The network may be a wireless or wired network such as an Intranet (Intranet), the Internet (Internet), a global system for mobile communications (Global System of Mobile communication, GSM), wideband code division multiple access (Wideband Code Division Multiple Access, WCDMA), a 4G network, a 5G network, bluetooth (Bluetooth), wi-Fi, or a call network.
The memory 11 includes at least one type of readable storage medium including flash memory, hard disk, multimedia card, card memory (e.g., SD or DX memory, etc.), random Access Memory (RAM), static Random Access Memory (SRAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), programmable Read Only Memory (PROM), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the storage 11 may be an internal storage unit of the electronic device 1, such as a hard disk or a memory of the electronic device 1. In other embodiments, the memory 11 may also be an external storage device of the electronic device 1, for example, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are equipped in the electronic device 1. Of course, the memory 11 may also comprise both an internal memory unit of the electronic device 1 and an external memory device. In the present embodiment, the memory 11 is typically used for storing an operating system and various types of computer programs installed in the electronic device 1, such as program codes of the document-based search program 10. Further, the memory 11 may be used to temporarily store various types of data that have been output or are to be output.
Processor 12 may be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 12 is typically used for controlling the overall operation of the electronic device 1, e.g. performing data interaction or communication related control and processing, etc. In this embodiment, the processor 12 is configured to execute program codes or process data stored in the memory 11, such as program codes or the like of the document-based retrieval program 10.
The display 13 may be referred to as a display screen or a display unit. The display 13 may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an Organic Light-Emitting Diode (OLED) touch device, or the like in some embodiments. The display 13 is used for displaying information processed in the electronic device 1 and for displaying a visual work interface.
The communication interface 14 may alternatively comprise a standard wired interface, a wireless interface, such as a WI-FI interface, which communication interface 14 is typically used for establishing a communication connection between the electronic device 1 and other electronic devices.
Fig. 5 shows only the electronic device 1 with components 11-14 and the document based retrieval program 10, but it should be understood that not all shown components are required to be implemented, and that more or fewer components may alternatively be implemented.
In the above-described embodiment, the processor 12 may implement the following steps when executing the document-based retrieval program 10 stored in the memory 11:
acquiring a document to be searched, and extracting a plurality of keywords of the document to be searched;
filling the keywords into a preset template text set to obtain at least one target text of the document to be retrieved;
generating a synonymous text segment corresponding to each target text segment according to the corresponding word segmentation in each target text segment;
according to each target text segment and the corresponding synonymous text segment of the target text segment, searching the target text segment and a candidate document set associated with the synonymous text segment from a preset database;
and calculating a target score of each candidate document in the candidate document set, sorting each candidate document according to the target scores, and taking the sorting result as a retrieval result of the document to be retrieved.
The storage device may be the memory 11 of the electronic device 1, or may be another storage device communicatively connected to the electronic device 1.
For a detailed description of the above steps, please refer to the above description of FIG. 4 for a functional block diagram of an embodiment of the document based retrieval device 100 and FIG. 1 for a flowchart of an embodiment of the document based retrieval method.
Furthermore, the embodiment of the application also provides a computer readable storage medium, which can be nonvolatile or volatile. The computer-readable storage medium includes a storage data area and a storage program area, the storage program area storing a document-based retrieval program 10, the document-based retrieval program 10, when executed by a processor, performs the operations of:
acquiring a document to be searched, and extracting a plurality of keywords of the document to be searched;
filling the keywords into a preset template text set to obtain at least one target text of the document to be retrieved;
generating a synonymous text segment corresponding to each target text segment according to the corresponding word segmentation in each target text segment;
according to each target text segment and the corresponding synonymous text segment of the target text segment, searching the target text segment and a candidate document set associated with the synonymous text segment from a preset database;
and calculating a target score of each candidate document in the candidate document set, sorting each candidate document according to the target scores, and taking the sorting result as a retrieval result of the document to be retrieved.
The embodiment of the computer readable storage medium of the present application is substantially the same as the embodiment of the document-based search method described above, and will not be described here again.
It should be noted that, the foregoing reference numerals of the embodiments of the present application are merely for describing the embodiments, and do not represent the advantages and disadvantages of the embodiments. And the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, apparatus, article or method that comprises the element.
From the above description of the embodiments, it will be clear to those skilled in the art that the above embodiment method may be implemented by means of software plus a necessary general hardware simulation platform, or may be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) as described above, comprising instructions for causing a terminal device (which may be a mobile phone, a computer, an electronic device, or a network device, etc.) to perform the method according to the embodiments of the present application.
The foregoing description is only of the preferred embodiments of the present application, and is not intended to limit the scope of the application, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.
Claims (10)
1. A document-based retrieval method, the method comprising:
acquiring a document to be searched, and extracting a plurality of keywords of the document to be searched;
filling the keywords into a preset template text set to obtain at least one target text of the document to be retrieved;
generating a synonymous text segment corresponding to each target text segment according to the corresponding word segmentation in each target text segment;
according to each target text segment and the corresponding synonymous text segment of the target text segment, searching the target text segment and a candidate document set associated with the synonymous text segment from a preset database;
and calculating a target score of each candidate document in the candidate document set, sorting each candidate document according to the target scores, and taking the sorting result as a retrieval result of the document to be retrieved.
2. The document-based retrieval method according to claim 1, wherein the filling the plurality of keywords into a set of preset template segments to obtain at least one target segment of the document to be retrieved includes:
identifying the part of speech of each keyword by using a lexical analysis model;
filling the keywords into blank positions of the preset template text sets according to the part of speech of each keyword to obtain a plurality of initial text sets;
deleting the grammar-incorrect text segment in the plurality of initial text segments to obtain the at least one target text segment.
3. The document-based retrieval method according to claim 1, wherein the generating a synonymous segment corresponding to each target segment according to a corresponding word segment in each target segment includes:
executing word segmentation operation on the target text segment to obtain a plurality of word segments corresponding to the target text segment;
sequentially executing covering operation on each word segmentation in the target text segment to obtain a plurality of covering text segments corresponding to the target text segment;
inputting each masking text segment into a BERT model respectively to obtain a plurality of synonyms of masking positions in each masking text segment;
and executing replacement operation on the segmentation in the target text segment based on the synonyms to obtain a synonym text segment corresponding to the target text segment.
4. The document-based retrieval method according to claim 1, wherein the retrieving, from a preset database, the target text and the candidate document set associated with the target text according to each of the target text and the synonymous text corresponding to the target text includes:
retrieving a first document set associated with each target text segment from a preset database;
retrieving a second document set associated with each synonymous text segment from a preset database;
and deleting the documents of the first document set and the repeated documents of the first document set to obtain the candidate document set.
5. The document-based retrieval method of claim 1, wherein said calculating a target score for each candidate document in the set of candidate documents comprises:
calculating the word set similarity of the keyword label words of the candidate document and the keyword of the document to be searched, and taking the word set similarity as a first score of the candidate document;
calculating the similarity of the abstract text of the candidate document and the text of the target text of the document to be searched, and taking the similarity of the text as a second score of the candidate document;
calculating a third score of the candidate document according to the search information of the candidate document;
and calculating the target score of the candidate document according to the first score, the second score, the third score and the preset weight.
6. The document-based retrieval method of claim 5, wherein calculating a third score for the candidate document based on the search information for the candidate document includes calculating the third score using a formula comprising:
wherein S (n) represents a third score, T, for the nth candidate document 0 Represents initial search weight, alpha is preset cooling coefficient, D n Represents the last time the nth candidate document was searched from the current time, D 0 Representing the current time.
7. The document-based retrieval method according to claim 5, wherein the calculating the similarity of the abstract text of the candidate document to the text of the target text of the document to be retrieved, taking the similarity of the text as the second score of the candidate document, comprises:
performing clause operation on the abstract text segment and the target text segment respectively to obtain a clause of the abstract segment and a clause of the target text segment;
converting the clause of the abstract segment into a first sentence vector set, and splicing sentence vectors of the first sentence vector set to obtain a text Duan Xiangliang of the abstract text segment;
converting the clause of the target text segment into a second sentence vector set, and splicing sentence vectors of the second sentence vector set to obtain a text Duan Xiangliang of the target text segment;
and calculating the similarity of the segment vector of the abstract segment and the segment vector of the target segment to serve as a second score of the candidate document.
8. A document-based retrieval apparatus, the apparatus comprising:
the acquisition module is used for: the method comprises the steps of obtaining a document to be searched, and extracting a plurality of keywords of the document to be searched;
and (3) filling a module: the method comprises the steps of filling the keywords into a preset template text set to obtain at least one target text of the document to be retrieved;
the generation module is used for: the method comprises the steps of generating synonymous text segments corresponding to each target text segment according to corresponding segmentation in each target text segment;
and a retrieval module: the candidate document set is used for retrieving the target text and the candidate document set associated with the synonymous text from a preset database according to each target text and the synonymous text corresponding to the target text;
the calculation module: and the method is used for calculating the target score of each candidate document in the candidate document set, sorting each candidate document according to the target scores, and taking the sorting result as the retrieval result of the document to be retrieved.
9. The electronic equipment is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;
a memory for storing a computer program;
a processor for implementing the document-based retrieval method according to any one of claims 1 to 7 when executing a program stored on a memory.
10. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the document-based retrieval method according to any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310906497.2A CN116842138A (en) | 2023-07-24 | 2023-07-24 | Document-based retrieval method, device, equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310906497.2A CN116842138A (en) | 2023-07-24 | 2023-07-24 | Document-based retrieval method, device, equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116842138A true CN116842138A (en) | 2023-10-03 |
Family
ID=88170685
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310906497.2A Pending CN116842138A (en) | 2023-07-24 | 2023-07-24 | Document-based retrieval method, device, equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116842138A (en) |
Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180181988A1 (en) * | 2016-12-26 | 2018-06-28 | Baidu Online Network Technology (Beijing) Co., Ltd. | Method and apparatus for pushing information |
CN110069763A (en) * | 2019-03-16 | 2019-07-30 | 平安科技(深圳)有限公司 | Contract text method for customizing, device, equipment and readable storage medium storing program for executing |
CN111753048A (en) * | 2020-05-21 | 2020-10-09 | 高新兴科技集团股份有限公司 | Document retrieval method, device, equipment and storage medium |
CN112507109A (en) * | 2020-12-11 | 2021-03-16 | 重庆知识产权大数据研究院有限公司 | Retrieval method and device based on semantic analysis and keyword recognition |
CN112988969A (en) * | 2021-03-09 | 2021-06-18 | 北京百度网讯科技有限公司 | Method, device, equipment and storage medium for text retrieval |
CN113094519A (en) * | 2021-05-07 | 2021-07-09 | 超凡知识产权服务股份有限公司 | Method and device for searching based on document |
WO2021164255A1 (en) * | 2020-07-28 | 2021-08-26 | 平安科技(深圳)有限公司 | Presentation generation method and apparatus, computer device and storage medium |
CN113377927A (en) * | 2021-06-28 | 2021-09-10 | 成都卫士通信息产业股份有限公司 | Similar document detection method and device, electronic equipment and storage medium |
WO2021175005A1 (en) * | 2020-03-04 | 2021-09-10 | 深圳壹账通智能科技有限公司 | Vector-based document retrieval method and apparatus, computer device, and storage medium |
CN113486169A (en) * | 2021-07-27 | 2021-10-08 | 平安国际智慧城市科技股份有限公司 | Synonymy statement generation method, device, equipment and storage medium based on BERT model |
CN113901173A (en) * | 2021-09-23 | 2022-01-07 | 深信服科技股份有限公司 | Retrieval method, retrieval device, electronic equipment and computer storage medium |
CN115309954A (en) * | 2022-08-30 | 2022-11-08 | 中信建投证券股份有限公司 | Data retrieval method, device, equipment and storage medium |
-
2023
- 2023-07-24 CN CN202310906497.2A patent/CN116842138A/en active Pending
Patent Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180181988A1 (en) * | 2016-12-26 | 2018-06-28 | Baidu Online Network Technology (Beijing) Co., Ltd. | Method and apparatus for pushing information |
CN110069763A (en) * | 2019-03-16 | 2019-07-30 | 平安科技(深圳)有限公司 | Contract text method for customizing, device, equipment and readable storage medium storing program for executing |
WO2021175005A1 (en) * | 2020-03-04 | 2021-09-10 | 深圳壹账通智能科技有限公司 | Vector-based document retrieval method and apparatus, computer device, and storage medium |
CN111753048A (en) * | 2020-05-21 | 2020-10-09 | 高新兴科技集团股份有限公司 | Document retrieval method, device, equipment and storage medium |
WO2021164255A1 (en) * | 2020-07-28 | 2021-08-26 | 平安科技(深圳)有限公司 | Presentation generation method and apparatus, computer device and storage medium |
CN112507109A (en) * | 2020-12-11 | 2021-03-16 | 重庆知识产权大数据研究院有限公司 | Retrieval method and device based on semantic analysis and keyword recognition |
CN112988969A (en) * | 2021-03-09 | 2021-06-18 | 北京百度网讯科技有限公司 | Method, device, equipment and storage medium for text retrieval |
CN113094519A (en) * | 2021-05-07 | 2021-07-09 | 超凡知识产权服务股份有限公司 | Method and device for searching based on document |
CN113377927A (en) * | 2021-06-28 | 2021-09-10 | 成都卫士通信息产业股份有限公司 | Similar document detection method and device, electronic equipment and storage medium |
CN113486169A (en) * | 2021-07-27 | 2021-10-08 | 平安国际智慧城市科技股份有限公司 | Synonymy statement generation method, device, equipment and storage medium based on BERT model |
CN113901173A (en) * | 2021-09-23 | 2022-01-07 | 深信服科技股份有限公司 | Retrieval method, retrieval device, electronic equipment and computer storage medium |
CN115309954A (en) * | 2022-08-30 | 2022-11-08 | 中信建投证券股份有限公司 | Data retrieval method, device, equipment and storage medium |
Non-Patent Citations (1)
Title |
---|
张玲达;金林;程秀霞;江飞;: "一种基于内容的混合模式过滤模型", 计算机工程, no. 24 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110502621B (en) | Question answering method, question answering device, computer equipment and storage medium | |
US8868469B2 (en) | System and method for phrase identification | |
TWI536181B (en) | Language identification in multilingual text | |
US7386438B1 (en) | Identifying language attributes through probabilistic analysis | |
JP4408129B2 (en) | Image document processing apparatus, image document processing method, program, and recording medium | |
US8577882B2 (en) | Method and system for searching multilingual documents | |
CN107085583B (en) | Electronic document management method and device based on content | |
CN112395385B (en) | Text generation method and device based on artificial intelligence, computer equipment and medium | |
JP4570648B2 (en) | Image document processing apparatus, image document processing method, image document processing program, and recording medium | |
CN111625621B (en) | Document retrieval method and device, electronic equipment and storage medium | |
WO2020056977A1 (en) | Knowledge point pushing method and device, and computer readable storage medium | |
CN115438166A (en) | Keyword and semantic-based searching method, device, equipment and storage medium | |
CN113434636A (en) | Semantic-based approximate text search method and device, computer equipment and medium | |
US11520835B2 (en) | Learning system, learning method, and program | |
CN114880447A (en) | Information retrieval method, device, equipment and storage medium | |
CN111460099B (en) | Keyword extraction method, device and storage medium | |
CN110795942B (en) | Keyword determination method and device based on semantic recognition and storage medium | |
CN115794995A (en) | Target answer obtaining method and related device, electronic equipment and storage medium | |
US11379527B2 (en) | Sibling search queries | |
US11151317B1 (en) | Contextual spelling correction system | |
CN111401012A (en) | Text error correction method, electronic device and computer readable storage medium | |
CN112487159B (en) | Search method, search device, and computer-readable storage medium | |
CN113254588A (en) | Data searching method and system | |
CN115563515B (en) | Text similarity detection method, device, equipment and storage medium | |
JP2005107931A (en) | Image search apparatus |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |