CN116842138A - Document-based retrieval method, device, equipment and storage medium - Google Patents

Document-based retrieval method, device, equipment and storage medium Download PDF

Info

Publication number
CN116842138A
CN116842138A CN202310906497.2A CN202310906497A CN116842138A CN 116842138 A CN116842138 A CN 116842138A CN 202310906497 A CN202310906497 A CN 202310906497A CN 116842138 A CN116842138 A CN 116842138A
Authority
CN
China
Prior art keywords
document
text
segment
target
target text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310906497.2A
Other languages
Chinese (zh)
Inventor
储铭钧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Chenghu Information Technology Co ltd
Original Assignee
Shanghai Chenghu Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Chenghu Information Technology Co ltd filed Critical Shanghai Chenghu Information Technology Co ltd
Priority to CN202310906497.2A priority Critical patent/CN116842138A/en
Publication of CN116842138A publication Critical patent/CN116842138A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to a retrieval method, a device, equipment and a storage medium based on a document, wherein the method comprises the following steps: obtaining a document to be searched, extracting a plurality of keywords of the document to be searched, filling the keywords into a preset template text set to obtain at least one target text of the document to be searched, generating a synonymous text corresponding to each target text according to the corresponding segmentation in each target text, searching out the target text and a candidate document set associated with the synonymous text from a preset database according to each target text and the synonymous text corresponding to each target text, calculating a target score of each candidate document in the candidate document set, sorting each candidate document according to the target scores, and taking the sorting result as a retrieval result of the document to be searched. The application can promote the comprehensiveness of the search.

Description

Document-based retrieval method, device, equipment and storage medium
Technical Field
The present application relates to the field of document retrieval technologies, and in particular, to a document-based retrieval method, apparatus, device, and storage medium.
Background
At present, document retrieval is usually carried out in a retrieval database according to keywords or abstract text segments of a document to be retrieved, and the retrieval result is incomplete because the keywords and abstract text segments are generally lack of expansion scenes of document core contents and core content transformation scenes.
Therefore, providing a method for improving the comprehensiveness of the search result has become a technical problem to be solved by those skilled in the art.
Disclosure of Invention
In view of the above, the present application provides a document-based retrieval method, device, apparatus and storage medium, and aims to solve the above technical problems.
In a first aspect, the present application provides a document-based retrieval method, the method comprising:
acquiring a document to be searched, and extracting a plurality of keywords of the document to be searched;
filling the keywords into a preset template text set to obtain at least one target text of the document to be retrieved;
generating a synonymous text segment corresponding to each target text segment according to the corresponding word segmentation in each target text segment;
according to each target text segment and the corresponding synonymous text segment of the target text segment, searching the target text segment and a candidate document set associated with the synonymous text segment from a preset database;
and calculating a target score of each candidate document in the candidate document set, sorting each candidate document according to the target scores, and taking the sorting result as a retrieval result of the document to be retrieved.
Preferably, the filling the keywords into a preset template text set to obtain at least one target text of the document to be retrieved includes:
identifying the part of speech of each keyword by using a lexical analysis model;
filling the keywords into blank positions of the preset template text sets according to the part of speech of each keyword to obtain a plurality of initial text sets;
deleting the grammar-incorrect text segment in the plurality of initial text segments to obtain the at least one target text segment.
Preferably, the generating, according to the corresponding word segmentation in each target segment, a synonymous segment corresponding to each target segment includes:
executing word segmentation operation on the target text segment to obtain a plurality of word segments corresponding to the target text segment;
sequentially executing covering operation on each word segmentation in the target text segment to obtain a plurality of covering text segments corresponding to the target text segment;
inputting each masking text segment into a BERT model respectively to obtain a plurality of synonyms of masking positions in each masking text segment;
and executing replacement operation on the segmentation in the target text segment based on the synonyms to obtain a synonym text segment corresponding to the target text segment.
Preferably, the retrieving, according to each target text segment and the synonymous text segment corresponding to the target text segment, the candidate document set associated with the target text segment and the synonymous text segment from a preset database includes:
retrieving a first document set associated with each target text segment from a preset database;
retrieving a second document set associated with each synonymous text segment from a preset database;
and deleting the documents of the first document set and the repeated documents of the first document set to obtain the candidate document set.
Preferably, the calculating a target score for each candidate document in the candidate document set includes:
calculating the word set similarity of the keyword label words of the candidate document and the keyword of the document to be searched, and taking the word set similarity as a first score of the candidate document;
calculating the similarity of the abstract text of the candidate document and the text of the target text of the document to be searched, and taking the similarity of the text as a second score of the candidate document;
calculating a third score of the candidate document according to the search information of the candidate document;
and calculating the target score of the candidate document according to the first score, the second score, the third score and the preset weight.
Preferably, the calculating the third score of the candidate document according to the search information of the candidate document includes calculating the third score using a formula including:
wherein S (n) represents a third score, T, for the nth candidate document 0 Represents initial search weight, alpha is preset cooling coefficient, D n Represents the last time the nth candidate document was searched from the current time, D 0 Representing the current time.
Preferably, the calculating the similarity between the abstract text of the candidate document and the text of the target text of the document to be retrieved, taking the similarity between the abstract text and the target text as the second score of the candidate document, includes:
performing clause operation on the abstract text segment and the target text segment respectively to obtain a clause of the abstract segment and a clause of the target text segment;
converting the clause of the abstract segment into a first sentence vector set, and splicing sentence vectors of the first sentence vector set to obtain a text Duan Xiangliang of the abstract text segment;
converting the clause of the target text segment into a second sentence vector set, and splicing sentence vectors of the second sentence vector set to obtain a text Duan Xiangliang of the target text segment;
and calculating the similarity of the segment vector of the abstract segment and the segment vector of the target segment to serve as a second score of the candidate document.
In a second aspect, the present application provides a document-based retrieval apparatus comprising:
the acquisition module is used for: the method comprises the steps of obtaining a document to be searched, and extracting a plurality of keywords of the document to be searched;
and (3) filling a module: the method comprises the steps of filling the keywords into a preset template text set to obtain at least one target text of the document to be retrieved;
the generation module is used for: the method comprises the steps of generating synonymous text segments corresponding to each target text segment according to corresponding segmentation in each target text segment;
and a retrieval module: the candidate document set is used for retrieving the target text and the candidate document set associated with the synonymous text from a preset database according to each target text and the synonymous text corresponding to the target text;
the calculation module: and the method is used for calculating the target score of each candidate document in the candidate document set, sorting each candidate document according to the target scores, and taking the sorting result as the retrieval result of the document to be retrieved.
In a third aspect, the present application provides an electronic device, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus;
a memory for storing a computer program;
and a processor, configured to implement the steps of the document-based retrieval method according to any one of the embodiments of the first aspect when executing the program stored in the memory.
In a fourth aspect, a computer-readable storage medium is provided, on which a computer program is stored which, when being executed by a processor, implements the steps of the document based retrieval method according to any one of the embodiments of the first aspect.
Compared with the prior art, the technical scheme provided by the embodiment of the application has the following advantages:
according to the method, the plurality of keywords of the document to be searched are extracted, the keywords are filled into the preset template text set to obtain at least one target text of the document to be searched, the extracted keywords are combined with the preset template text set to generate the target text which characterizes more expansion scenes and transformation scenes, the synonym text corresponding to each target text is generated according to the corresponding segmentation words in each target text, and the searching range of the document to be searched can be further enlarged. According to each target text segment and the corresponding candidate document set associated with the corresponding synonymous text segment, the target score of each candidate document in the candidate document set is calculated, each candidate document is ranked according to the target scores, and the ranking result is used as the retrieval result of the document to be retrieved. As the target text and the synonymous text represent more expansion scenes and transformation scenes, the comprehensiveness of the search is improved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.
In order to more clearly illustrate the embodiments of the application or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, and it will be obvious to a person skilled in the art that other drawings can be obtained from these drawings without inventive effort.
FIG. 1 is a schematic view of an application environment of a document-based retrieval method of the present application;
FIG. 2 is a flow chart of calculating target scores of candidate documents according to the present application;
FIG. 3 is a flow chart illustrating a method for calculating a second score of a candidate document according to the present application;
FIG. 4 is a schematic block diagram of a preferred embodiment of a document-based search apparatus according to the present application;
FIG. 5 is a schematic diagram of an electronic device according to a preferred embodiment of the present application;
the achievement of the objects, functional features and advantages of the present application will be further described with reference to the accompanying drawings, in conjunction with the embodiments.
Detailed Description
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
It should be noted that the description of "first", "second", etc. in this disclosure is for descriptive purposes only and is not to be construed as indicating or implying a relative importance or implying an indication of the number of technical features being indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In addition, the technical solutions of the embodiments may be combined with each other, but it is necessary to base that the technical solutions can be realized by those skilled in the art, and when the technical solutions are contradictory or cannot be realized, the combination of the technical solutions should be considered to be absent and not within the scope of protection claimed in the present application.
Referring to FIG. 1, a method flow diagram of an embodiment of a document-based retrieval method of the present application is shown. The method may be performed by an electronic device, which may be implemented in software and/or hardware, for example, the electronic device may be a data center server or a cloud server, or may be a terminal device. The document-based retrieval method comprises the following steps:
step S1: acquiring a document to be searched, and extracting a plurality of keywords of the document to be searched;
step S2: filling the keywords into a preset template text set to obtain at least one target text of the document to be retrieved;
step S3: generating a synonymous text segment corresponding to each target text segment according to the corresponding word segmentation in each target text segment;
step S4: according to each target text segment and the corresponding synonymous text segment of the target text segment, searching the target text segment and a candidate document set associated with the synonymous text segment from a preset database;
step S5: and calculating a target score of each candidate document in the candidate document set, sorting each candidate document according to the target scores, and taking the sorting result as a retrieval result of the document to be retrieved.
In this embodiment, the user may upload the document to be searched in the search interface, or may input the document to be searched from the search interface, where the format of the document to be searched may be a word document, a notepad document, a PDF document, or the like.
After the document to be searched is obtained, a plurality of keywords of the document to be searched are extracted, the keywords of the document to be searched can be identified through a pre-trained semantic analysis model, the semantic analysis model can be obtained based on training of an HMM model, the document to be searched is input into a value semantic analysis model, and the model can output the semantic keywords of the document to be searched.
The maximum forward matching algorithm or the maximum reverse matching algorithm can be used for executing word segmentation operation on the document to be searched, counting the occurrence frequency of all word segments of the document to be searched, calculating IDF (inverse document frequency value), and then calculating TF (word frequency) value of each word in the document to be searched. Wherein tf= (number of occurrences of the term in the document)/(sum of the number of occurrences of each term in the document), and the TF value is multiplied by the IDF value to obtain TF-IDF value of each term, the larger the TF-IDF value is, the higher the priority of the term as a keyword is. That is, the larger the TF-IDF value, the higher the importance of the term to the document to be retrieved, so that several terms with TF-IDF values arranged in front may be used as keywords of the document to be retrieved, for example, the term with TF-IDF value arranged in front 10 may be selected as keywords of the document to be retrieved.
Compared with keywords, the text can more accurately represent the core content of the document to be searched, but the abstract text in the document to be searched usually only briefly describes the core content of the document, and the abstract text usually lacks the expansion scene of the core content and the transformation scene of the core content, so that the extracted keywords are combined with a preset template text set to generate the target text which can represent more expansion scenes and transformation scenes. The preset template text set comprises a plurality of template text, the template text can be preconfigured according to actual scene requirements, keywords are filled in the template text, target text of a plurality of scenes corresponding to the document to be searched can be obtained, and the target text comprises an expansion scene and a transformation scene of the document to be searched. Specifically, filling the keywords into a preset template text set to obtain at least one target text of the document to be retrieved, including:
identifying the part of speech of each keyword by using a lexical analysis model;
filling the keywords into blank positions of the preset template text sets according to the part of speech of each keyword to obtain a plurality of initial text sets;
deleting the grammar-incorrect text segment in the plurality of initial text segments to obtain the at least one target text segment.
The part of speech of each keyword, such as a common noun, an azimuthal noun, a common verb, an adjective, a pronoun, etc., can be identified through a lexical analysis model. According to the part of speech of each keyword, filling a plurality of keywords corresponding to a document to be searched into blank positions of a plurality of preset template segments to obtain a plurality of initial segments, wherein each initial segment is generated after the corresponding template segment is filled with the keywords, and because the initial segments after the keywords are filled possibly have the conditions of semantic logic errors or grammar errors, the segments with logic errors and grammar errors in the initial segments are required to be deleted, and the segments obtained after the deletion operation in the initial segments are used as target segments.
In order to further expand the retrieval range of the document to be retrieved, after obtaining the target text segments representing the expansion scene and the transformation scene of the document to be retrieved, generating the synonymous text segment corresponding to each target text segment according to the corresponding segmentation word in each target text segment, thereby expanding the text segment associated with the target text segment, wherein the synonymous text segment can be obtained by matching from preset text Duan Ku according to semantic similarity. In one embodiment, generating the synonymous text segment corresponding to each target text segment according to the corresponding segmentation word in each target text segment includes:
executing word segmentation operation on the target text segment to obtain a plurality of word segments corresponding to the target text segment;
sequentially executing covering operation on each word segmentation in the target text segment to obtain a plurality of covering text segments corresponding to the target text segment;
inputting each masking text segment into a BERT model respectively to obtain a plurality of synonyms of masking positions in each masking text segment;
and executing replacement operation on the segmentation in the target text segment based on the synonyms to obtain a synonym text segment corresponding to the target text segment.
Assuming that the target text is a single sentence, taking the single sentence as an example of "feature extraction which is the key of image recognition", the word segmentation operation is performed on the single sentence to obtain the word segments including "image", "recognition", "key", "yes", "feature" and "extraction".
Sequentially executing a covering operation on each word segmentation in the target text segment to obtain a plurality of covering text segments corresponding to the target text segment, wherein the obtained covering text segments comprise:
key to mask recognition is feature extraction
The key to image mask is feature extraction
Image recognition mask key is feature extraction
Mask for image recognition is feature extraction
Key mask feature extraction for image recognition
The key of image recognition is mask extraction
The key of image recognition is a feature mask
And respectively inputting each masking text segment into the BERT model, so that a plurality of synonyms of the masking position of each masking text segment can be obtained. For example, the synonyms of the image may include "picture", "portrait", and the like, and the replacement operation is performed on the word segments in the target segment according to the obtained synonyms, so as to obtain the synonym segment corresponding to the target segment, for example, "the key of picture recognition is feature extraction", "the key of portrait recognition is feature extraction". Because the target text segment comprises the expansion scene and the transformation scene of the document to be searched, the synonymous text segment corresponding to the target text segment can further expand the scene of the document to be searched.
After the synonymous text corresponding to the target text is obtained, according to each target text and synonymous text, a candidate document set associated with the target text and synonymous text is retrieved from a preset database, wherein the preset database can be a local database or a third party database, and as the target text and the synonymous text both represent the expansion scene and the transformation scene of the document to be retrieved, the association between the candidate document set obtained through the retrieval of the target text and the synonymous text and the document to be retrieved is higher. For example, the candidate document set associated with the target text and the synonymous text is retrieved from the database by using the semantic information of the target text and the synonymous text, or the candidate document set associated with the target text and the synonymous text is retrieved from the database by using the keyword information of the target text and the synonymous text. Specifically, according to each target text segment and a synonymous text segment corresponding to the target text segment, a candidate document set associated with the target text segment and the synonymous text segment is retrieved from a preset database, including:
retrieving a first document set associated with each target text segment from a preset database;
retrieving a second document set associated with each synonymous text segment from a preset database;
and deleting the documents of the first document set and the repeated documents of the first document set to obtain the candidate document set.
The documents associated with each target segment are retrieved from the database and noted as a first set of documents. The documents associated with each synonymous segment are retrieved from the database and noted as a second set of documents. Since there may be duplicate documents in the first document set and the second document set, it is necessary to perform a deduplication operation of the documents, thereby obtaining a candidate document set.
After the candidate document set is obtained, calculating a target score of each candidate document in the candidate document set, wherein the target score characterizes the relevance of the candidate document and the document to be searched, and the higher the target score is, the higher the relevance of the candidate document and the document to be searched is. And sequencing each candidate document according to the target score, and taking the sequencing result as a retrieval result of the document to be retrieved. Referring to fig. 2, a flow chart of calculating target scores of candidate documents according to the present application is shown, wherein the flow of calculating target scores includes:
step S51: calculating the word set similarity of the keyword label words of the candidate document and the keyword of the document to be searched, and taking the word set similarity as a first score of the candidate document;
step S52: calculating the similarity of the abstract text of the candidate document and the text of the target text of the document to be searched, and taking the similarity of the text as a second score of the candidate document;
step S53: calculating a third score of the candidate document according to the search information of the candidate document;
step S54: and calculating the target score of the candidate document according to the first score, the second score, the third score and the preset weight.
The key tag words of the candidate documents can be directly obtained from a database, each document in the database has a corresponding key tag word, and the key tag words can be used for classifying the documents in the database so as to facilitate retrieval. The keyword tag of each candidate document is respectively used as a word set, the keyword of the document to be searched is used as another word set, the word set similarity between the keyword tag of the candidate document and the keyword of the document to be searched is calculated, and the calculated word set similarity is used as a first score of the candidate document;
the abstract text of the candidate document can also be directly obtained from a database, each document in the database has a corresponding abstract text, and a user can quickly know the core content of the document through the abstract text. And calculating the similarity of the abstract text of each candidate document and the text of the target text of the document to be retrieved, and taking the calculated similarity of the text as the second score of the candidate document. Referring to fig. 3, which is a schematic flow chart of calculating a second score of a candidate document according to the present application, specifically, calculating a similarity between a summary segment of the candidate document and a target segment of the document to be retrieved, and using the similarity between the segments as the second score of the candidate document, the method includes:
step S521: performing clause operation on the abstract text segment and the target text segment respectively to obtain a clause of the abstract segment and a clause of the target text segment;
step S522: converting the clause of the abstract segment into a first sentence vector set, and splicing sentence vectors of the first sentence vector set to obtain a text Duan Xiangliang of the abstract text segment;
step S523: converting the clause of the target text segment into a second sentence vector set, and splicing sentence vectors of the second sentence vector set to obtain a text Duan Xiangliang of the target text segment;
step S524: and calculating the similarity of the segment vector of the abstract segment and the segment vector of the target segment to serve as a second score of the candidate document.
And respectively dividing sentences of the abstract text segment and the target text segment to obtain the sentence of the abstract segment and the sentence of the target text segment, converting the sentence of the abstract segment into a plurality of sentence vectors by using a word2vec model, marking the sentence vectors as a first sentence vector set, and splicing the sentence vectors of the first sentence vector set to obtain the text segment vector of the abstract text segment. And converting the clauses of the target text segment into a plurality of sentence vectors, recording the sentence vectors as a second sentence vector set, and splicing the sentence vectors of the second sentence vector set to obtain the text segment vector of the target text segment. And calculating the similarity of the segment vector of the abstract segment and the segment vector of the target segment to obtain a second score of the candidate document.
Search information of each candidate document, for example, the number of searches, the number of browses, the search date, and the like is acquired. Calculating a third score for each candidate document based on the search information for the candidate document, in particular, calculating the third score for the candidate document based on the search information for the candidate document, comprises calculating the third score using a formula comprising:
wherein S (n) represents a third score, T, for the nth candidate document 0 Represents initial search weight, alpha is preset cooling coefficient, D n Represents the last time the nth candidate document was searched from the current time, D 0 Representing the current time. This time may be in hours, for example, the time since the last time the current time was searched is 10:09 minutes, the current time is 17:30, and the time difference is 7 hours. So as to have the following period of timeDocuments searched by other users may obtain a higher third score, with non-searched documents decreasing with time.
And calculating the target score of the candidate document according to the first score, the second score, the third score and the preset weight, wherein the weight of the first score is 0.4, the weight of the second score is 0.4, the weight of the third score is 0.2, and the target score of the candidate document is calculated according to the preset weight.
Referring to fig. 4, a functional block diagram of a document-based search apparatus 100 according to the present application is shown.
The document-based retrieval apparatus 100 of the present application may be installed in an electronic device. Depending on the functionality implemented, the document based retrieval device 100 may include an acquisition module 110, a population module 120, a generation module 130, a retrieval module 140, and a calculation module 150. The module of the application, which may also be referred to as a unit, refers to a series of computer program segments, which are stored in the memory of the electronic device, capable of being executed by the processor of the electronic device and of performing a fixed function.
In the present embodiment, the functions concerning the respective modules/units are as follows:
the acquisition module is used for: the method comprises the steps of obtaining a document to be searched, and extracting a plurality of keywords of the document to be searched;
and (3) filling a module: the method comprises the steps of filling the keywords into a preset template text set to obtain at least one target text of the document to be retrieved;
the generation module is used for: the method comprises the steps of generating synonymous text segments corresponding to each target text segment according to corresponding segmentation in each target text segment;
and a retrieval module: the candidate document set is used for retrieving the target text and the candidate document set associated with the synonymous text from a preset database according to each target text and the synonymous text corresponding to the target text;
the calculation module: and the method is used for calculating the target score of each candidate document in the candidate document set, sorting each candidate document according to the target scores, and taking the sorting result as the retrieval result of the document to be retrieved.
In one embodiment, the filling the keywords into a preset template segment set to obtain at least one target segment of the document to be retrieved includes:
identifying the part of speech of each keyword by using a lexical analysis model;
filling the keywords into blank positions of the preset template text sets according to the part of speech of each keyword to obtain a plurality of initial text sets;
deleting the grammar-incorrect text segment in the plurality of initial text segments to obtain the at least one target text segment.
In one embodiment, the generating, according to the corresponding segmentation in each target segment, a synonymous segment corresponding to each target segment includes:
executing word segmentation operation on the target text segment to obtain a plurality of word segments corresponding to the target text segment;
sequentially executing covering operation on each word segmentation in the target text segment to obtain a plurality of covering text segments corresponding to the target text segment;
inputting each masking text segment into a BERT model respectively to obtain a plurality of synonyms of masking positions in each masking text segment;
and executing replacement operation on the segmentation in the target text segment based on the synonyms to obtain a synonym text segment corresponding to the target text segment.
In one embodiment, the retrieving, according to each target segment and a synonymous segment corresponding to the target segment, a candidate document set associated with the target segment and the synonymous segment from a preset database includes:
retrieving a first document set associated with each target text segment from a preset database;
retrieving a second document set associated with each synonymous text segment from a preset database;
and deleting the documents of the first document set and the repeated documents of the first document set to obtain the candidate document set.
In one embodiment, the calculating the target score for each candidate document in the set of candidate documents includes:
calculating the word set similarity of the keyword label words of the candidate document and the keyword of the document to be searched, and taking the word set similarity as a first score of the candidate document;
calculating the similarity of the abstract text of the candidate document and the text of the target text of the document to be searched, and taking the similarity of the text as a second score of the candidate document;
calculating a third score of the candidate document according to the search information of the candidate document;
and calculating the target score of the candidate document according to the first score, the second score, the third score and the preset weight.
In one embodiment, the calculating the third score for the candidate document based on the search information for the candidate document includes calculating the third score using a formula including:
wherein S (n) represents a third score, T, for the nth candidate document 0 Represents initial search weight, alpha is preset cooling coefficient, D n Represents the last time the nth candidate document was searched from the current time, D 0 Representing the current time.
In one embodiment, the calculating the similarity between the abstract text of the candidate document and the target text of the document to be retrieved, taking the similarity between the abstract text and the target text as the second score of the candidate document includes:
performing clause operation on the abstract text segment and the target text segment respectively to obtain a clause of the abstract segment and a clause of the target text segment;
converting the clause of the abstract segment into a first sentence vector set, and splicing sentence vectors of the first sentence vector set to obtain a text Duan Xiangliang of the abstract text segment;
converting the clause of the target text segment into a second sentence vector set, and splicing sentence vectors of the second sentence vector set to obtain a text Duan Xiangliang of the target text segment;
and calculating the similarity of the segment vector of the abstract segment and the segment vector of the target segment to serve as a second score of the candidate document.
Referring to fig. 5, a schematic diagram of a preferred embodiment of an electronic device 1 according to the present application is shown.
The electronic device 1 includes, but is not limited to: memory 11, processor 12, display 13, and communication interface 14. The electronic device 1 may be connected to a network via a communication interface 14. The network may be a wireless or wired network such as an Intranet (Intranet), the Internet (Internet), a global system for mobile communications (Global System of Mobile communication, GSM), wideband code division multiple access (Wideband Code Division Multiple Access, WCDMA), a 4G network, a 5G network, bluetooth (Bluetooth), wi-Fi, or a call network.
The memory 11 includes at least one type of readable storage medium including flash memory, hard disk, multimedia card, card memory (e.g., SD or DX memory, etc.), random Access Memory (RAM), static Random Access Memory (SRAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), programmable Read Only Memory (PROM), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the storage 11 may be an internal storage unit of the electronic device 1, such as a hard disk or a memory of the electronic device 1. In other embodiments, the memory 11 may also be an external storage device of the electronic device 1, for example, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are equipped in the electronic device 1. Of course, the memory 11 may also comprise both an internal memory unit of the electronic device 1 and an external memory device. In the present embodiment, the memory 11 is typically used for storing an operating system and various types of computer programs installed in the electronic device 1, such as program codes of the document-based search program 10. Further, the memory 11 may be used to temporarily store various types of data that have been output or are to be output.
Processor 12 may be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 12 is typically used for controlling the overall operation of the electronic device 1, e.g. performing data interaction or communication related control and processing, etc. In this embodiment, the processor 12 is configured to execute program codes or process data stored in the memory 11, such as program codes or the like of the document-based retrieval program 10.
The display 13 may be referred to as a display screen or a display unit. The display 13 may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an Organic Light-Emitting Diode (OLED) touch device, or the like in some embodiments. The display 13 is used for displaying information processed in the electronic device 1 and for displaying a visual work interface.
The communication interface 14 may alternatively comprise a standard wired interface, a wireless interface, such as a WI-FI interface, which communication interface 14 is typically used for establishing a communication connection between the electronic device 1 and other electronic devices.
Fig. 5 shows only the electronic device 1 with components 11-14 and the document based retrieval program 10, but it should be understood that not all shown components are required to be implemented, and that more or fewer components may alternatively be implemented.
In the above-described embodiment, the processor 12 may implement the following steps when executing the document-based retrieval program 10 stored in the memory 11:
acquiring a document to be searched, and extracting a plurality of keywords of the document to be searched;
filling the keywords into a preset template text set to obtain at least one target text of the document to be retrieved;
generating a synonymous text segment corresponding to each target text segment according to the corresponding word segmentation in each target text segment;
according to each target text segment and the corresponding synonymous text segment of the target text segment, searching the target text segment and a candidate document set associated with the synonymous text segment from a preset database;
and calculating a target score of each candidate document in the candidate document set, sorting each candidate document according to the target scores, and taking the sorting result as a retrieval result of the document to be retrieved.
The storage device may be the memory 11 of the electronic device 1, or may be another storage device communicatively connected to the electronic device 1.
For a detailed description of the above steps, please refer to the above description of FIG. 4 for a functional block diagram of an embodiment of the document based retrieval device 100 and FIG. 1 for a flowchart of an embodiment of the document based retrieval method.
Furthermore, the embodiment of the application also provides a computer readable storage medium, which can be nonvolatile or volatile. The computer-readable storage medium includes a storage data area and a storage program area, the storage program area storing a document-based retrieval program 10, the document-based retrieval program 10, when executed by a processor, performs the operations of:
acquiring a document to be searched, and extracting a plurality of keywords of the document to be searched;
filling the keywords into a preset template text set to obtain at least one target text of the document to be retrieved;
generating a synonymous text segment corresponding to each target text segment according to the corresponding word segmentation in each target text segment;
according to each target text segment and the corresponding synonymous text segment of the target text segment, searching the target text segment and a candidate document set associated with the synonymous text segment from a preset database;
and calculating a target score of each candidate document in the candidate document set, sorting each candidate document according to the target scores, and taking the sorting result as a retrieval result of the document to be retrieved.
The embodiment of the computer readable storage medium of the present application is substantially the same as the embodiment of the document-based search method described above, and will not be described here again.
It should be noted that, the foregoing reference numerals of the embodiments of the present application are merely for describing the embodiments, and do not represent the advantages and disadvantages of the embodiments. And the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, apparatus, article or method that comprises the element.
From the above description of the embodiments, it will be clear to those skilled in the art that the above embodiment method may be implemented by means of software plus a necessary general hardware simulation platform, or may be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) as described above, comprising instructions for causing a terminal device (which may be a mobile phone, a computer, an electronic device, or a network device, etc.) to perform the method according to the embodiments of the present application.
The foregoing description is only of the preferred embodiments of the present application, and is not intended to limit the scope of the application, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.

Claims (10)

1. A document-based retrieval method, the method comprising:
acquiring a document to be searched, and extracting a plurality of keywords of the document to be searched;
filling the keywords into a preset template text set to obtain at least one target text of the document to be retrieved;
generating a synonymous text segment corresponding to each target text segment according to the corresponding word segmentation in each target text segment;
according to each target text segment and the corresponding synonymous text segment of the target text segment, searching the target text segment and a candidate document set associated with the synonymous text segment from a preset database;
and calculating a target score of each candidate document in the candidate document set, sorting each candidate document according to the target scores, and taking the sorting result as a retrieval result of the document to be retrieved.
2. The document-based retrieval method according to claim 1, wherein the filling the plurality of keywords into a set of preset template segments to obtain at least one target segment of the document to be retrieved includes:
identifying the part of speech of each keyword by using a lexical analysis model;
filling the keywords into blank positions of the preset template text sets according to the part of speech of each keyword to obtain a plurality of initial text sets;
deleting the grammar-incorrect text segment in the plurality of initial text segments to obtain the at least one target text segment.
3. The document-based retrieval method according to claim 1, wherein the generating a synonymous segment corresponding to each target segment according to a corresponding word segment in each target segment includes:
executing word segmentation operation on the target text segment to obtain a plurality of word segments corresponding to the target text segment;
sequentially executing covering operation on each word segmentation in the target text segment to obtain a plurality of covering text segments corresponding to the target text segment;
inputting each masking text segment into a BERT model respectively to obtain a plurality of synonyms of masking positions in each masking text segment;
and executing replacement operation on the segmentation in the target text segment based on the synonyms to obtain a synonym text segment corresponding to the target text segment.
4. The document-based retrieval method according to claim 1, wherein the retrieving, from a preset database, the target text and the candidate document set associated with the target text according to each of the target text and the synonymous text corresponding to the target text includes:
retrieving a first document set associated with each target text segment from a preset database;
retrieving a second document set associated with each synonymous text segment from a preset database;
and deleting the documents of the first document set and the repeated documents of the first document set to obtain the candidate document set.
5. The document-based retrieval method of claim 1, wherein said calculating a target score for each candidate document in the set of candidate documents comprises:
calculating the word set similarity of the keyword label words of the candidate document and the keyword of the document to be searched, and taking the word set similarity as a first score of the candidate document;
calculating the similarity of the abstract text of the candidate document and the text of the target text of the document to be searched, and taking the similarity of the text as a second score of the candidate document;
calculating a third score of the candidate document according to the search information of the candidate document;
and calculating the target score of the candidate document according to the first score, the second score, the third score and the preset weight.
6. The document-based retrieval method of claim 5, wherein calculating a third score for the candidate document based on the search information for the candidate document includes calculating the third score using a formula comprising:
wherein S (n) represents a third score, T, for the nth candidate document 0 Represents initial search weight, alpha is preset cooling coefficient, D n Represents the last time the nth candidate document was searched from the current time, D 0 Representing the current time.
7. The document-based retrieval method according to claim 5, wherein the calculating the similarity of the abstract text of the candidate document to the text of the target text of the document to be retrieved, taking the similarity of the text as the second score of the candidate document, comprises:
performing clause operation on the abstract text segment and the target text segment respectively to obtain a clause of the abstract segment and a clause of the target text segment;
converting the clause of the abstract segment into a first sentence vector set, and splicing sentence vectors of the first sentence vector set to obtain a text Duan Xiangliang of the abstract text segment;
converting the clause of the target text segment into a second sentence vector set, and splicing sentence vectors of the second sentence vector set to obtain a text Duan Xiangliang of the target text segment;
and calculating the similarity of the segment vector of the abstract segment and the segment vector of the target segment to serve as a second score of the candidate document.
8. A document-based retrieval apparatus, the apparatus comprising:
the acquisition module is used for: the method comprises the steps of obtaining a document to be searched, and extracting a plurality of keywords of the document to be searched;
and (3) filling a module: the method comprises the steps of filling the keywords into a preset template text set to obtain at least one target text of the document to be retrieved;
the generation module is used for: the method comprises the steps of generating synonymous text segments corresponding to each target text segment according to corresponding segmentation in each target text segment;
and a retrieval module: the candidate document set is used for retrieving the target text and the candidate document set associated with the synonymous text from a preset database according to each target text and the synonymous text corresponding to the target text;
the calculation module: and the method is used for calculating the target score of each candidate document in the candidate document set, sorting each candidate document according to the target scores, and taking the sorting result as the retrieval result of the document to be retrieved.
9. The electronic equipment is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;
a memory for storing a computer program;
a processor for implementing the document-based retrieval method according to any one of claims 1 to 7 when executing a program stored on a memory.
10. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the document-based retrieval method according to any one of claims 1 to 7.
CN202310906497.2A 2023-07-24 2023-07-24 Document-based retrieval method, device, equipment and storage medium Pending CN116842138A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310906497.2A CN116842138A (en) 2023-07-24 2023-07-24 Document-based retrieval method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310906497.2A CN116842138A (en) 2023-07-24 2023-07-24 Document-based retrieval method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN116842138A true CN116842138A (en) 2023-10-03

Family

ID=88170685

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310906497.2A Pending CN116842138A (en) 2023-07-24 2023-07-24 Document-based retrieval method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116842138A (en)

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180181988A1 (en) * 2016-12-26 2018-06-28 Baidu Online Network Technology (Beijing) Co., Ltd. Method and apparatus for pushing information
CN110069763A (en) * 2019-03-16 2019-07-30 平安科技(深圳)有限公司 Contract text method for customizing, device, equipment and readable storage medium storing program for executing
CN111753048A (en) * 2020-05-21 2020-10-09 高新兴科技集团股份有限公司 Document retrieval method, device, equipment and storage medium
CN112507109A (en) * 2020-12-11 2021-03-16 重庆知识产权大数据研究院有限公司 Retrieval method and device based on semantic analysis and keyword recognition
CN112988969A (en) * 2021-03-09 2021-06-18 北京百度网讯科技有限公司 Method, device, equipment and storage medium for text retrieval
CN113094519A (en) * 2021-05-07 2021-07-09 超凡知识产权服务股份有限公司 Method and device for searching based on document
WO2021164255A1 (en) * 2020-07-28 2021-08-26 平安科技(深圳)有限公司 Presentation generation method and apparatus, computer device and storage medium
CN113377927A (en) * 2021-06-28 2021-09-10 成都卫士通信息产业股份有限公司 Similar document detection method and device, electronic equipment and storage medium
WO2021175005A1 (en) * 2020-03-04 2021-09-10 深圳壹账通智能科技有限公司 Vector-based document retrieval method and apparatus, computer device, and storage medium
CN113486169A (en) * 2021-07-27 2021-10-08 平安国际智慧城市科技股份有限公司 Synonymy statement generation method, device, equipment and storage medium based on BERT model
CN113901173A (en) * 2021-09-23 2022-01-07 深信服科技股份有限公司 Retrieval method, retrieval device, electronic equipment and computer storage medium
CN115309954A (en) * 2022-08-30 2022-11-08 中信建投证券股份有限公司 Data retrieval method, device, equipment and storage medium

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180181988A1 (en) * 2016-12-26 2018-06-28 Baidu Online Network Technology (Beijing) Co., Ltd. Method and apparatus for pushing information
CN110069763A (en) * 2019-03-16 2019-07-30 平安科技(深圳)有限公司 Contract text method for customizing, device, equipment and readable storage medium storing program for executing
WO2021175005A1 (en) * 2020-03-04 2021-09-10 深圳壹账通智能科技有限公司 Vector-based document retrieval method and apparatus, computer device, and storage medium
CN111753048A (en) * 2020-05-21 2020-10-09 高新兴科技集团股份有限公司 Document retrieval method, device, equipment and storage medium
WO2021164255A1 (en) * 2020-07-28 2021-08-26 平安科技(深圳)有限公司 Presentation generation method and apparatus, computer device and storage medium
CN112507109A (en) * 2020-12-11 2021-03-16 重庆知识产权大数据研究院有限公司 Retrieval method and device based on semantic analysis and keyword recognition
CN112988969A (en) * 2021-03-09 2021-06-18 北京百度网讯科技有限公司 Method, device, equipment and storage medium for text retrieval
CN113094519A (en) * 2021-05-07 2021-07-09 超凡知识产权服务股份有限公司 Method and device for searching based on document
CN113377927A (en) * 2021-06-28 2021-09-10 成都卫士通信息产业股份有限公司 Similar document detection method and device, electronic equipment and storage medium
CN113486169A (en) * 2021-07-27 2021-10-08 平安国际智慧城市科技股份有限公司 Synonymy statement generation method, device, equipment and storage medium based on BERT model
CN113901173A (en) * 2021-09-23 2022-01-07 深信服科技股份有限公司 Retrieval method, retrieval device, electronic equipment and computer storage medium
CN115309954A (en) * 2022-08-30 2022-11-08 中信建投证券股份有限公司 Data retrieval method, device, equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张玲达;金林;程秀霞;江飞;: "一种基于内容的混合模式过滤模型", 计算机工程, no. 24 *

Similar Documents

Publication Publication Date Title
CN110502621B (en) Question answering method, question answering device, computer equipment and storage medium
US8868469B2 (en) System and method for phrase identification
TWI536181B (en) Language identification in multilingual text
US7386438B1 (en) Identifying language attributes through probabilistic analysis
JP4408129B2 (en) Image document processing apparatus, image document processing method, program, and recording medium
US8577882B2 (en) Method and system for searching multilingual documents
CN107085583B (en) Electronic document management method and device based on content
CN112395385B (en) Text generation method and device based on artificial intelligence, computer equipment and medium
JP4570648B2 (en) Image document processing apparatus, image document processing method, image document processing program, and recording medium
CN111625621B (en) Document retrieval method and device, electronic equipment and storage medium
WO2020056977A1 (en) Knowledge point pushing method and device, and computer readable storage medium
CN115438166A (en) Keyword and semantic-based searching method, device, equipment and storage medium
CN113434636A (en) Semantic-based approximate text search method and device, computer equipment and medium
US11520835B2 (en) Learning system, learning method, and program
CN114880447A (en) Information retrieval method, device, equipment and storage medium
CN111460099B (en) Keyword extraction method, device and storage medium
CN110795942B (en) Keyword determination method and device based on semantic recognition and storage medium
CN115794995A (en) Target answer obtaining method and related device, electronic equipment and storage medium
US11379527B2 (en) Sibling search queries
US11151317B1 (en) Contextual spelling correction system
CN111401012A (en) Text error correction method, electronic device and computer readable storage medium
CN112487159B (en) Search method, search device, and computer-readable storage medium
CN113254588A (en) Data searching method and system
CN115563515B (en) Text similarity detection method, device, equipment and storage medium
JP2005107931A (en) Image search apparatus

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination