CN111813930A

CN111813930A - Similar document retrieval method and device

Info

Publication number: CN111813930A
Application number: CN202010543812.6A
Authority: CN
Inventors: 毛红保
Original assignee: Iol Wuhan Information Technology Co ltd
Current assignee: Iol Wuhan Information Technology Co ltd
Priority date: 2020-06-15
Filing date: 2020-06-15
Publication date: 2020-10-23
Anticipated expiration: 2040-06-15
Also published as: CN111813930B; WO2021253873A1

Abstract

The embodiment of the invention provides a method and a device for searching similar documents, wherein the method comprises the following steps: searching based on a word frequency searching model to obtain a first document set and the similarity of each document, and searching based on a document vectorization model to obtain a second document set and the similarity of each document; overlapping the similarity of the same documents in the first document set and the second document set, and selecting a preset number of documents according to the similarity from large to small to obtain a candidate document set; and determining a retrieval result according to the candidate document set. The method simultaneously considers the results of the word frequency searching method and the document vectorization searching method, and combines the results through the similarity, so that the semantic inertia is eliminated to a certain extent, the multi-dimensional searching result is obtained, and the limitation of the searching result obtained by a single model is avoided.

Description

Similar document retrieval method and device

Technical Field

The invention relates to the field of natural language analysis, in particular to a method and a device for retrieving similar documents.

Background

The document retrieval means that a document to be retrieved is given, and the document most similar to the document content is automatically retrieved from the massive document library. Document retrieval has a wide application scene, in the translation field, when a manuscript to be translated is received, a document similar to the theme content of the manuscript needs to be retrieved from a historical manuscript library so as to be quickly matched with a proper translator, and therefore translation quality and efficiency are improved.

The traditional document retrieval method mainly adopts a method related to keywords, such as TF-IDF (term frequency-inverse document frequency) and the like, and the method can meet the requirements in most cases, but has the defect of neglecting the inter-word sequence. For example, if a document contains a large number of phrases such as "machine learning", the search will be split into two keywords "machine" and "learning" for search; if all the machine learning in the document is replaced by the learning machine, the retrieval result is not influenced. To address such issues, deep learning based document semantic representations are applied in document retrieval, such as the document vectorization model Doc2 vec. The document vectorization model is sensitive to word sequences and can better represent documents from a semantic level, but semantic inertia may exist in the actual application process. For example, the top 5 documents with the highest matching degree with "motorcycle production" need to be searched, and the document library contains a large number of documents related to "motorcycle sales" and "automobile production", and at this time, if the semantic representation method is adopted for searching, it is likely that the top 5 documents are all related to "automobile production". This is because the semantic representation method is more sensitive to the semantics at the global level of the document rather than highlighting a certain keyword. But the user is likely to want the first 5 documents to be both "car production" and "motorcycle sales". It can be seen that the retrieval results obtained based on the current methods are often limited, and accurate search results cannot be obtained.

Disclosure of Invention

In order to solve the above problems, embodiments of the present invention provide a method and an apparatus for retrieving similar documents.

In a first aspect, an embodiment of the present invention provides a similar document retrieval method, including: searching based on a word frequency searching model to obtain a first document set and the similarity of each document, and searching based on a document vectorization model to obtain a second document set and the similarity of each document; overlapping the similarity of the same documents in the first document set and the second document set, and selecting a preset number of documents according to the similarity from large to small to obtain the candidate document set; and determining a retrieval result according to the candidate document set.

Further, the determining a search result according to the candidate document set includes: selecting documents with a first preset proportion from the second document set as a third document set according to the similarity from large to small; and updating the similarity of the same documents in the third document set by using the similarity in the candidate document set, and selecting the documents with a second preset proportion from the third document set as a retrieval result according to the similarity.

Further, before superimposing the similarity of the same document in the first document set and the second document set, the method further includes: and respectively carrying out normalization processing on the document similarity in the first document set and the second document set.

Further, the number of documents in the first document set, the second document set and the candidate document set is consistent.

Further, the word frequency searching model is a TF-IDF model.

Further, the document vectorization model is a Doc2vec model.

Further, the first preset ratio is 2/3, and the second preset ratio is 1/2.

In a second aspect, an embodiment of the present invention provides a similar document retrieval apparatus, including: the classification acquisition module is used for searching and obtaining a first document set and the similarity of each document based on the word frequency search model, and searching and obtaining a second document set and the similarity of each document based on the document vectorization model; the similarity superposition module is used for superposing the similarity of the same documents in the first document set and the second document set, and selecting a preset number of documents according to the similarity from large to small to obtain a candidate document set; and the retrieval result determining module is used for determining the retrieval result according to the candidate document set.

In a third aspect, an embodiment of the present invention provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the steps of the similar document retrieval method according to the first aspect of the present invention.

In a fourth aspect, an embodiment of the present invention provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the similar document retrieval method according to the first aspect of the present invention.

According to the similar document retrieval method and device provided by the embodiment of the invention, the similarity of the same documents in the first document set and the second document set is overlapped, a preset number of documents are selected according to the similarity from large to small to obtain a candidate document set, and the results of a word frequency search method and a document vectorization search method are considered and combined through the similarity, so that the semantic inertia is eliminated to a certain extent, a multi-dimensional retrieval result is obtained, and the limitation of the retrieval result obtained by a single model is avoided.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a flowchart of a similar document retrieval method provided by an embodiment of the present invention;

FIG. 2 is a flowchart of a similar document retrieval method according to another embodiment of the present invention;

FIG. 3 is a block diagram of a similar document retrieval apparatus according to an embodiment of the present invention;

fig. 4 is a schematic physical structure diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a flowchart of a similar document retrieval method provided in an embodiment of the present invention, and as shown in fig. 1, the embodiment of the present invention provides a similar document retrieval method, including:

101. and searching based on a word frequency searching model to obtain the similarity of each document in the first document set, and searching based on a document vectorization model to obtain the similarity of the second document set and each document.

The term frequency search model generally refers to a type of model that searches according to the term frequency of a keyword, such as a TF-IDF model. Document vectorization models generally refer to a class of models for keyword vector-based semantic retrieval, such as the Doc2vec model and the word2vec model.

In the specific implementation process, keyword retrieval is carried out on the documents to be retrieved based on the word frequency search model, keyword retrieval results of the documents to be retrieved are obtained, a first document set is obtained and recorded as Result_TF-IDF. Performing semantic vectorization representation on the document to be retrieved, retrieving based on the document vectorization model, obtaining the semantic retrieval Result of the document to be retrieved, obtaining a second document set, and recording as Result_Doc2vec. In addition to the retrieval result, the similarity of each retrieved document is obtained, and the similarity represents the similarity between the retrieved document and the document to be retrieved.

102. Overlapping the similarity of the same documents in the first document set and the second document set, selecting a preset number of documents according to the similarity from large to small to obtain a candidate document set, and recording the candidate document set as Result_combination。

Given that the same documents exist in the first document set and the second document set, the similarity of the same documents is superimposed, and the similarity of other documents in the two sets remains unchanged. And then, sorting the whole document according to the similarity, and selecting a preset number of documents from the documents as a candidate document set.

As an alternative embodiment, the number of documents in the first set of documents, the second set of documents, and the candidate set of documents remain consistent. Specifically, the values may be the same or similar. For example, the number of documents in the first document set, the second document set and the candidate document set is N, so that the balance of word frequency search and vectorization search based on documents is ensured.

103. And determining a retrieval result according to the candidate document set.

In the candidate document set, a word frequency searching mode and a semantic searching mode are comprehensively considered, a final retrieval result is determined according to the candidate document set, and the limitation of the retrieval result obtained by a single model can be avoided. For example, a part of the candidate documents may be selected as a search result, or the search result may be further determined according to the candidate documents, the first document and the second document set.

According to the similar document retrieval method provided by the embodiment of the invention, the similarity of the same documents in the first document set and the second document set is overlapped, a preset number of documents are selected according to the similarity from large to small to obtain a candidate document set, and the results of the word frequency search method and the document vectorization search method are considered and combined through the similarity, so that the semantic inertia is eliminated to a certain extent, the multi-dimensional retrieval result is obtained, and the limitation of the retrieval result obtained by a single model is avoided.

Based on the content of the foregoing embodiment, as an alternative embodiment, determining a search result according to a candidate document set includes: selecting documents with a first preset proportion from the second document set as a third document set according to the similarity from large to small; and updating the similarity of the same documents in the third document set by using the similarity in the candidate document set, and selecting the documents with a second preset proportion from the third document set as a retrieval result according to the similarity.

Fig. 2 is a flowchart of a similar document retrieval method according to another embodiment of the present invention, as shown in fig. 2, for a second document set, documents with a first preset ratio are selected according to similarity, for example, the number of the first document set, the second document set, and the candidate document set is 3N. First stepWith a scale of 2/3, a third set of documents is selected that is 2N. And for each document in the third document set, if the document exists in the candidate document set, updating the similarity value of the document in the third document set by using the similarity value in the candidate document set, and keeping the similarity values of other documents in the third document set unchanged. And reordering the updated third document set according to the similarity, and selecting the documents with the second preset proportion as the retrieval result. For example, the second preset proportion is 1/2, and the first N results with the similarity from large to small are selected_mergeAs the final search result.

The similar document retrieval method of the embodiment of the invention mainly takes the semantic retrieval result of the document vectorization model, and adjusts the semantic retrieval result by keyword retrieval, thereby eliminating semantic inertia to a certain extent, obtaining a multi-dimensional retrieval result and ensuring the accuracy of the retrieval result.

Based on the content of the foregoing embodiment, as an optional embodiment, before superimposing the similarity of the same document in the first document set and the second document set, the method further includes: and respectively carrying out normalization processing on the document similarity in the first document set and the second document set.

And respectively normalizing the document similarity in the first document set obtained by semantic retrieval and the document similarity in the second document set of the keyword retrieval result, and then superposing the document similarities existing in the two sets at the same time. By respectively carrying out normalization processing on the similarity of the documents in the first document set and the second document set, the influence caused by imbalance of the similarity of the first document set and the second document set is avoided.

Based on the content of the above embodiments, as an alternative embodiment, the word frequency search model is a TF-IDF model.

TF-IDF is a commonly used weighting method for information retrieval and data mining. TF is term Frequency (term Frequency) and IDF is Inverse text Frequency index (Inverse Document Frequency).

The method comprises the steps of training a TF-IDF model of a document library by using a genesis tool based on python language, carrying out keyword vectorization representation and retrieval on a document to be retrieved based on the model, and obtaining a keyword retrieval result of the document to be retrieved.

Based on the content of the above embodiments, as an alternative embodiment, the document vectorization model is a Doc2vec model. Doc2vec is an unsupervised algorithm, can obtain vector expression of text, and is an extension of word2 vec. The learned vectors can be used for finding out the similarity between texts by calculating the distance, can be used for text clustering, and can be used for text classification by using a supervised learning method for labeled data, such as a classical emotion analysis problem.

The method comprises the steps of training a Doc2vec model of a document library by using a genesis tool based on python language, carrying out semantic vectorization representation and retrieval on a document to be retrieved based on the model, and obtaining a semantic retrieval result of the document to be retrieved.

Based on the content of the above embodiments, as an alternative embodiment, the first preset ratio is 2/3, and the second preset ratio is 1/2. The above embodiments have been illustrated and will not be described herein.

Fig. 3 is a structural diagram of a similar document retrieval apparatus according to an embodiment of the present invention, and as shown in fig. 3, the similar document retrieval apparatus includes: a classification acquisition module 301, a similarity superposition module 302 and a retrieval result determination module 303. The classification obtaining module 301 is configured to obtain a first document set and a similarity of each document based on a word frequency search model search, and obtain a second document set and a similarity of each document based on a document vectorization model search; the similarity overlapping module 302 is configured to overlap similarities of the same documents in the first document set and the second document set, and select a preset number of documents from large to small according to the similarities to obtain a candidate document set; the search result determining module 303 is configured to determine a search result according to the candidate document set.

Based on the content of the foregoing embodiment, as an optional embodiment, the retrieval result determining module 303 is specifically configured to: selecting documents with a first preset proportion from the second document set as a third document set according to the similarity from large to small; and updating the similarity of the same documents in the third document set by using the similarity in the candidate document set, and selecting the documents with a second preset proportion from the third document set as a retrieval result according to the similarity.

The device embodiment provided in the embodiments of the present invention is for implementing the above method embodiments, and for details of the process and the details, reference is made to the above method embodiments, which are not described herein again.

The similar document retrieval device provided by the embodiment of the invention superposes the similarity of the same documents in the first document set and the second document set, selects a preset number of documents according to the similarity from large to small to obtain a candidate document set, and simultaneously considers the results of the word frequency search method and the document vectorization search method and combines the similarity, thereby eliminating semantic inertia to a certain extent, obtaining a multi-dimensional retrieval result and avoiding the limitation of the retrieval result obtained by a single model.

Fig. 4 is a schematic entity structure diagram of an electronic device according to an embodiment of the present invention, where as shown in fig. 4, the electronic device may include: a processor (processor)401, a communication Interface (communication Interface)402, a memory (memory)403 and a bus 404, wherein the processor 401, the communication Interface 402 and the memory 403 complete communication with each other through the bus 404. The communication interface 402 may be used for information transfer of an electronic device. Processor 401 may call logic instructions in memory 403 to perform a method comprising: searching based on a word frequency searching model to obtain a first document set and the similarity of each document, and searching based on a document vectorization model to obtain a second document set and the similarity of each document; overlapping the similarity of the same documents in the first document set and the second document set, and selecting a preset number of documents according to the similarity from large to small to obtain a candidate document set; and determining a retrieval result according to the candidate document set.

In addition, the logic instructions in the memory 403 may be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the above-described method embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In another aspect, an embodiment of the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program is implemented to perform the transmission method provided in the foregoing embodiments when executed by a processor, and for example, the method includes: searching based on a word frequency searching model to obtain a first document set and the similarity of each document, and searching based on a document vectorization model to obtain a second document set and the similarity of each document; overlapping the similarity of the same documents in the first document set and the second document set, and selecting a preset number of documents according to the similarity from large to small to obtain a candidate document set; and determining a retrieval result according to the candidate document set.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for retrieving similar documents, comprising:

searching based on a word frequency searching model to obtain a first document set and the similarity of each document, and searching based on a document vectorization model to obtain a second document set and the similarity of each document;

overlapping the similarity of the same documents in the first document set and the second document set, and selecting a preset number of documents according to the similarity from large to small to obtain a candidate document set;

and determining a retrieval result according to the candidate document set.

2. The method for retrieving similar documents according to claim 1, wherein said determining a retrieval result according to said candidate document set comprises:

selecting documents with a first preset proportion from the second document set as a third document set according to the similarity from large to small;

and updating the similarity of the same documents in the third document set by using the similarity in the candidate document set, and selecting the documents with a second preset proportion from the third document set as a retrieval result according to the similarity.

3. The similar document retrieval method according to any one of claims 1-2, wherein before superimposing the similarity of the same document in the first document set and the second document set, further comprising:

and respectively carrying out normalization processing on the document similarity in the first document set and the second document set.

4. The method for retrieving similar documents according to any of claims 1-2, wherein the number of documents in said first set of documents, said second set of documents and said candidate set of documents is kept consistent.

5. The method for retrieving similar documents according to any of claims 1-2, wherein said word frequency search model is a TF-IDF model.

6. The method for retrieving similar documents according to any of the claims 1-2, wherein said document vectorization model is a Doc2vec model.

7. The method of claim 2, wherein the first predetermined ratio is 2/3, and the second predetermined ratio is 1/2.

8. A similar document retrieval apparatus, comprising:

the classification acquisition module is used for searching and obtaining a first document set and the similarity of each document based on the word frequency search model, and searching and obtaining a second document set and the similarity of each document based on the document vectorization model;

the similarity superposition module is used for superposing the similarity of the same documents in the first document set and the second document set, and selecting a preset number of documents according to the similarity from large to small to obtain a candidate document set;

and the retrieval result determining module is used for determining the retrieval result according to the candidate document set.

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the similar document retrieval method according to any of claims 1 to 7 are implemented when the program is executed by the processor.

10. A non-transitory computer readable storage medium, on which a computer program is stored, the computer program, when being executed by a processor, implementing the steps of the similar document retrieval method according to any one of claims 1 to 7.