CN111813930A - Similar document retrieval method and device - Google Patents

Similar document retrieval method and device Download PDF

Info

Publication number
CN111813930A
CN111813930A CN202010543812.6A CN202010543812A CN111813930A CN 111813930 A CN111813930 A CN 111813930A CN 202010543812 A CN202010543812 A CN 202010543812A CN 111813930 A CN111813930 A CN 111813930A
Authority
CN
China
Prior art keywords
document
similarity
document set
documents
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010543812.6A
Other languages
Chinese (zh)
Other versions
CN111813930B (en
Inventor
毛红保
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Iol Wuhan Information Technology Co ltd
Original Assignee
Iol Wuhan Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Iol Wuhan Information Technology Co ltd filed Critical Iol Wuhan Information Technology Co ltd
Priority to CN202010543812.6A priority Critical patent/CN111813930B/en
Publication of CN111813930A publication Critical patent/CN111813930A/en
Priority to PCT/CN2021/078813 priority patent/WO2021253873A1/en
Application granted granted Critical
Publication of CN111813930B publication Critical patent/CN111813930B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention provides a method and a device for searching similar documents, wherein the method comprises the following steps: searching based on a word frequency searching model to obtain a first document set and the similarity of each document, and searching based on a document vectorization model to obtain a second document set and the similarity of each document; overlapping the similarity of the same documents in the first document set and the second document set, and selecting a preset number of documents according to the similarity from large to small to obtain a candidate document set; and determining a retrieval result according to the candidate document set. The method simultaneously considers the results of the word frequency searching method and the document vectorization searching method, and combines the results through the similarity, so that the semantic inertia is eliminated to a certain extent, the multi-dimensional searching result is obtained, and the limitation of the searching result obtained by a single model is avoided.

Description

Similar document retrieval method and device
Technical Field
The invention relates to the field of natural language analysis, in particular to a method and a device for retrieving similar documents.
Background
The document retrieval means that a document to be retrieved is given, and the document most similar to the document content is automatically retrieved from the massive document library. Document retrieval has a wide application scene, in the translation field, when a manuscript to be translated is received, a document similar to the theme content of the manuscript needs to be retrieved from a historical manuscript library so as to be quickly matched with a proper translator, and therefore translation quality and efficiency are improved.
The traditional document retrieval method mainly adopts a method related to keywords, such as TF-IDF (term frequency-inverse document frequency) and the like, and the method can meet the requirements in most cases, but has the defect of neglecting the inter-word sequence. For example, if a document contains a large number of phrases such as "machine learning", the search will be split into two keywords "machine" and "learning" for search; if all the machine learning in the document is replaced by the learning machine, the retrieval result is not influenced. To address such issues, deep learning based document semantic representations are applied in document retrieval, such as the document vectorization model Doc2 vec. The document vectorization model is sensitive to word sequences and can better represent documents from a semantic level, but semantic inertia may exist in the actual application process. For example, the top 5 documents with the highest matching degree with "motorcycle production" need to be searched, and the document library contains a large number of documents related to "motorcycle sales" and "automobile production", and at this time, if the semantic representation method is adopted for searching, it is likely that the top 5 documents are all related to "automobile production". This is because the semantic representation method is more sensitive to the semantics at the global level of the document rather than highlighting a certain keyword. But the user is likely to want the first 5 documents to be both "car production" and "motorcycle sales". It can be seen that the retrieval results obtained based on the current methods are often limited, and accurate search results cannot be obtained.
Disclosure of Invention
In order to solve the above problems, embodiments of the present invention provide a method and an apparatus for retrieving similar documents.
In a first aspect, an embodiment of the present invention provides a similar document retrieval method, including: searching based on a word frequency searching model to obtain a first document set and the similarity of each document, and searching based on a document vectorization model to obtain a second document set and the similarity of each document; overlapping the similarity of the same documents in the first document set and the second document set, and selecting a preset number of documents according to the similarity from large to small to obtain the candidate document set; and determining a retrieval result according to the candidate document set.
Further, the determining a search result according to the candidate document set includes: selecting documents with a first preset proportion from the second document set as a third document set according to the similarity from large to small; and updating the similarity of the same documents in the third document set by using the similarity in the candidate document set, and selecting the documents with a second preset proportion from the third document set as a retrieval result according to the similarity.
Further, before superimposing the similarity of the same document in the first document set and the second document set, the method further includes: and respectively carrying out normalization processing on the document similarity in the first document set and the second document set.
Further, the number of documents in the first document set, the second document set and the candidate document set is consistent.
Further, the word frequency searching model is a TF-IDF model.
Further, the document vectorization model is a Doc2vec model.
Further, the first preset ratio is 2/3, and the second preset ratio is 1/2.
In a second aspect, an embodiment of the present invention provides a similar document retrieval apparatus, including: the classification acquisition module is used for searching and obtaining a first document set and the similarity of each document based on the word frequency search model, and searching and obtaining a second document set and the similarity of each document based on the document vectorization model; the similarity superposition module is used for superposing the similarity of the same documents in the first document set and the second document set, and selecting a preset number of documents according to the similarity from large to small to obtain a candidate document set; and the retrieval result determining module is used for determining the retrieval result according to the candidate document set.
In a third aspect, an embodiment of the present invention provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the steps of the similar document retrieval method according to the first aspect of the present invention.
In a fourth aspect, an embodiment of the present invention provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the similar document retrieval method according to the first aspect of the present invention.
According to the similar document retrieval method and device provided by the embodiment of the invention, the similarity of the same documents in the first document set and the second document set is overlapped, a preset number of documents are selected according to the similarity from large to small to obtain a candidate document set, and the results of a word frequency search method and a document vectorization search method are considered and combined through the similarity, so that the semantic inertia is eliminated to a certain extent, a multi-dimensional retrieval result is obtained, and the limitation of the retrieval result obtained by a single model is avoided.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 is a flowchart of a similar document retrieval method provided by an embodiment of the present invention;
FIG. 2 is a flowchart of a similar document retrieval method according to another embodiment of the present invention;
FIG. 3 is a block diagram of a similar document retrieval apparatus according to an embodiment of the present invention;
fig. 4 is a schematic physical structure diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a flowchart of a similar document retrieval method provided in an embodiment of the present invention, and as shown in fig. 1, the embodiment of the present invention provides a similar document retrieval method, including:
101. and searching based on a word frequency searching model to obtain the similarity of each document in the first document set, and searching based on a document vectorization model to obtain the similarity of the second document set and each document.
The term frequency search model generally refers to a type of model that searches according to the term frequency of a keyword, such as a TF-IDF model. Document vectorization models generally refer to a class of models for keyword vector-based semantic retrieval, such as the Doc2vec model and the word2vec model.
In the specific implementation process, keyword retrieval is carried out on the documents to be retrieved based on the word frequency search model, keyword retrieval results of the documents to be retrieved are obtained, a first document set is obtained and recorded as ResultTF-IDF. Performing semantic vectorization representation on the document to be retrieved, retrieving based on the document vectorization model, obtaining the semantic retrieval Result of the document to be retrieved, obtaining a second document set, and recording as ResultDoc2vec. In addition to the retrieval result, the similarity of each retrieved document is obtained, and the similarity represents the similarity between the retrieved document and the document to be retrieved.
102. Overlapping the similarity of the same documents in the first document set and the second document set, selecting a preset number of documents according to the similarity from large to small to obtain a candidate document set, and recording the candidate document set as Resultcombination
Given that the same documents exist in the first document set and the second document set, the similarity of the same documents is superimposed, and the similarity of other documents in the two sets remains unchanged. And then, sorting the whole document according to the similarity, and selecting a preset number of documents from the documents as a candidate document set.
As an alternative embodiment, the number of documents in the first set of documents, the second set of documents, and the candidate set of documents remain consistent. Specifically, the values may be the same or similar. For example, the number of documents in the first document set, the second document set and the candidate document set is N, so that the balance of word frequency search and vectorization search based on documents is ensured.
103. And determining a retrieval result according to the candidate document set.
In the candidate document set, a word frequency searching mode and a semantic searching mode are comprehensively considered, a final retrieval result is determined according to the candidate document set, and the limitation of the retrieval result obtained by a single model can be avoided. For example, a part of the candidate documents may be selected as a search result, or the search result may be further determined according to the candidate documents, the first document and the second document set.
According to the similar document retrieval method provided by the embodiment of the invention, the similarity of the same documents in the first document set and the second document set is overlapped, a preset number of documents are selected according to the similarity from large to small to obtain a candidate document set, and the results of the word frequency search method and the document vectorization search method are considered and combined through the similarity, so that the semantic inertia is eliminated to a certain extent, the multi-dimensional retrieval result is obtained, and the limitation of the retrieval result obtained by a single model is avoided.
Based on the content of the foregoing embodiment, as an alternative embodiment, determining a search result according to a candidate document set includes: selecting documents with a first preset proportion from the second document set as a third document set according to the similarity from large to small; and updating the similarity of the same documents in the third document set by using the similarity in the candidate document set, and selecting the documents with a second preset proportion from the third document set as a retrieval result according to the similarity.
Fig. 2 is a flowchart of a similar document retrieval method according to another embodiment of the present invention, as shown in fig. 2, for a second document set, documents with a first preset ratio are selected according to similarity, for example, the number of the first document set, the second document set, and the candidate document set is 3N. First stepWith a scale of 2/3, a third set of documents is selected that is 2N. And for each document in the third document set, if the document exists in the candidate document set, updating the similarity value of the document in the third document set by using the similarity value in the candidate document set, and keeping the similarity values of other documents in the third document set unchanged. And reordering the updated third document set according to the similarity, and selecting the documents with the second preset proportion as the retrieval result. For example, the second preset proportion is 1/2, and the first N results with the similarity from large to small are selectedmergeAs the final search result.
The similar document retrieval method of the embodiment of the invention mainly takes the semantic retrieval result of the document vectorization model, and adjusts the semantic retrieval result by keyword retrieval, thereby eliminating semantic inertia to a certain extent, obtaining a multi-dimensional retrieval result and ensuring the accuracy of the retrieval result.
Based on the content of the foregoing embodiment, as an optional embodiment, before superimposing the similarity of the same document in the first document set and the second document set, the method further includes: and respectively carrying out normalization processing on the document similarity in the first document set and the second document set.
And respectively normalizing the document similarity in the first document set obtained by semantic retrieval and the document similarity in the second document set of the keyword retrieval result, and then superposing the document similarities existing in the two sets at the same time. By respectively carrying out normalization processing on the similarity of the documents in the first document set and the second document set, the influence caused by imbalance of the similarity of the first document set and the second document set is avoided.
Based on the content of the above embodiments, as an alternative embodiment, the word frequency search model is a TF-IDF model.
TF-IDF is a commonly used weighting method for information retrieval and data mining. TF is term Frequency (term Frequency) and IDF is Inverse text Frequency index (Inverse Document Frequency).
The method comprises the steps of training a TF-IDF model of a document library by using a genesis tool based on python language, carrying out keyword vectorization representation and retrieval on a document to be retrieved based on the model, and obtaining a keyword retrieval result of the document to be retrieved.
Based on the content of the above embodiments, as an alternative embodiment, the document vectorization model is a Doc2vec model. Doc2vec is an unsupervised algorithm, can obtain vector expression of text, and is an extension of word2 vec. The learned vectors can be used for finding out the similarity between texts by calculating the distance, can be used for text clustering, and can be used for text classification by using a supervised learning method for labeled data, such as a classical emotion analysis problem.
The method comprises the steps of training a Doc2vec model of a document library by using a genesis tool based on python language, carrying out semantic vectorization representation and retrieval on a document to be retrieved based on the model, and obtaining a semantic retrieval result of the document to be retrieved.
Based on the content of the above embodiments, as an alternative embodiment, the first preset ratio is 2/3, and the second preset ratio is 1/2. The above embodiments have been illustrated and will not be described herein.
Fig. 3 is a structural diagram of a similar document retrieval apparatus according to an embodiment of the present invention, and as shown in fig. 3, the similar document retrieval apparatus includes: a classification acquisition module 301, a similarity superposition module 302 and a retrieval result determination module 303. The classification obtaining module 301 is configured to obtain a first document set and a similarity of each document based on a word frequency search model search, and obtain a second document set and a similarity of each document based on a document vectorization model search; the similarity overlapping module 302 is configured to overlap similarities of the same documents in the first document set and the second document set, and select a preset number of documents from large to small according to the similarities to obtain a candidate document set; the search result determining module 303 is configured to determine a search result according to the candidate document set.
Based on the content of the foregoing embodiment, as an optional embodiment, the retrieval result determining module 303 is specifically configured to: selecting documents with a first preset proportion from the second document set as a third document set according to the similarity from large to small; and updating the similarity of the same documents in the third document set by using the similarity in the candidate document set, and selecting the documents with a second preset proportion from the third document set as a retrieval result according to the similarity.
The device embodiment provided in the embodiments of the present invention is for implementing the above method embodiments, and for details of the process and the details, reference is made to the above method embodiments, which are not described herein again.
The similar document retrieval device provided by the embodiment of the invention superposes the similarity of the same documents in the first document set and the second document set, selects a preset number of documents according to the similarity from large to small to obtain a candidate document set, and simultaneously considers the results of the word frequency search method and the document vectorization search method and combines the similarity, thereby eliminating semantic inertia to a certain extent, obtaining a multi-dimensional retrieval result and avoiding the limitation of the retrieval result obtained by a single model.
Fig. 4 is a schematic entity structure diagram of an electronic device according to an embodiment of the present invention, where as shown in fig. 4, the electronic device may include: a processor (processor)401, a communication Interface (communication Interface)402, a memory (memory)403 and a bus 404, wherein the processor 401, the communication Interface 402 and the memory 403 complete communication with each other through the bus 404. The communication interface 402 may be used for information transfer of an electronic device. Processor 401 may call logic instructions in memory 403 to perform a method comprising: searching based on a word frequency searching model to obtain a first document set and the similarity of each document, and searching based on a document vectorization model to obtain a second document set and the similarity of each document; overlapping the similarity of the same documents in the first document set and the second document set, and selecting a preset number of documents according to the similarity from large to small to obtain a candidate document set; and determining a retrieval result according to the candidate document set.
In addition, the logic instructions in the memory 403 may be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the above-described method embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
In another aspect, an embodiment of the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program is implemented to perform the transmission method provided in the foregoing embodiments when executed by a processor, and for example, the method includes: searching based on a word frequency searching model to obtain a first document set and the similarity of each document, and searching based on a document vectorization model to obtain a second document set and the similarity of each document; overlapping the similarity of the same documents in the first document set and the second document set, and selecting a preset number of documents according to the similarity from large to small to obtain a candidate document set; and determining a retrieval result according to the candidate document set.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A method for retrieving similar documents, comprising:
searching based on a word frequency searching model to obtain a first document set and the similarity of each document, and searching based on a document vectorization model to obtain a second document set and the similarity of each document;
overlapping the similarity of the same documents in the first document set and the second document set, and selecting a preset number of documents according to the similarity from large to small to obtain a candidate document set;
and determining a retrieval result according to the candidate document set.
2. The method for retrieving similar documents according to claim 1, wherein said determining a retrieval result according to said candidate document set comprises:
selecting documents with a first preset proportion from the second document set as a third document set according to the similarity from large to small;
and updating the similarity of the same documents in the third document set by using the similarity in the candidate document set, and selecting the documents with a second preset proportion from the third document set as a retrieval result according to the similarity.
3. The similar document retrieval method according to any one of claims 1-2, wherein before superimposing the similarity of the same document in the first document set and the second document set, further comprising:
and respectively carrying out normalization processing on the document similarity in the first document set and the second document set.
4. The method for retrieving similar documents according to any of claims 1-2, wherein the number of documents in said first set of documents, said second set of documents and said candidate set of documents is kept consistent.
5. The method for retrieving similar documents according to any of claims 1-2, wherein said word frequency search model is a TF-IDF model.
6. The method for retrieving similar documents according to any of the claims 1-2, wherein said document vectorization model is a Doc2vec model.
7. The method of claim 2, wherein the first predetermined ratio is 2/3, and the second predetermined ratio is 1/2.
8. A similar document retrieval apparatus, comprising:
the classification acquisition module is used for searching and obtaining a first document set and the similarity of each document based on the word frequency search model, and searching and obtaining a second document set and the similarity of each document based on the document vectorization model;
the similarity superposition module is used for superposing the similarity of the same documents in the first document set and the second document set, and selecting a preset number of documents according to the similarity from large to small to obtain a candidate document set;
and the retrieval result determining module is used for determining the retrieval result according to the candidate document set.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the similar document retrieval method according to any of claims 1 to 7 are implemented when the program is executed by the processor.
10. A non-transitory computer readable storage medium, on which a computer program is stored, the computer program, when being executed by a processor, implementing the steps of the similar document retrieval method according to any one of claims 1 to 7.
CN202010543812.6A 2020-06-15 2020-06-15 Similar document retrieval method and device Active CN111813930B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202010543812.6A CN111813930B (en) 2020-06-15 2020-06-15 Similar document retrieval method and device
PCT/CN2021/078813 WO2021253873A1 (en) 2020-06-15 2021-03-03 Method and apparatus for retrieving similar document

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010543812.6A CN111813930B (en) 2020-06-15 2020-06-15 Similar document retrieval method and device

Publications (2)

Publication Number Publication Date
CN111813930A true CN111813930A (en) 2020-10-23
CN111813930B CN111813930B (en) 2024-02-20

Family

ID=72845178

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010543812.6A Active CN111813930B (en) 2020-06-15 2020-06-15 Similar document retrieval method and device

Country Status (2)

Country Link
CN (1) CN111813930B (en)
WO (1) WO2021253873A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112632907A (en) * 2021-01-04 2021-04-09 北京明略软件***有限公司 Document marking method, device and equipment
CN113094519A (en) * 2021-05-07 2021-07-09 超凡知识产权服务股份有限公司 Method and device for searching based on document
WO2021253873A1 (en) * 2020-06-15 2021-12-23 语联网(武汉)信息技术有限公司 Method and apparatus for retrieving similar document

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114912431A (en) * 2022-06-01 2022-08-16 北京金山数字娱乐科技有限公司 Document searching method and device
CN114780690B (en) * 2022-06-20 2022-09-09 成都信息工程大学 Patent text retrieval method and device based on multi-mode matrix vector representation

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105302793A (en) * 2015-10-21 2016-02-03 南方电网科学研究院有限责任公司 Method for automatically evaluating novelty of scientific and technical literature by using computer
CN107562824A (en) * 2017-08-21 2018-01-09 昆明理工大学 A kind of text similarity detection method
CN107704469A (en) * 2016-08-08 2018-02-16 中国科学院文献情报中心 The mapping method and device of patent data and industry data
US20190065506A1 (en) * 2017-08-28 2019-02-28 Beijing Baidu Netcom Science And Technology Co., Ltd. Search method and apparatus based on artificial intelligence
CN109858028A (en) * 2019-01-30 2019-06-07 神思电子技术股份有限公司 A kind of short text similarity calculating method based on probabilistic model

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104281583B (en) * 2013-07-02 2018-01-12 索意互动(北京)信息技术有限公司 Information retrieval method and device
CN106095737A (en) * 2016-06-07 2016-11-09 杭州凡闻科技有限公司 Documents Similarity computational methods and similar document the whole network retrieval tracking
CN107220307B (en) * 2017-05-10 2020-09-25 清华大学 Webpage searching method and device
CN111813930B (en) * 2020-06-15 2024-02-20 语联网(武汉)信息技术有限公司 Similar document retrieval method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105302793A (en) * 2015-10-21 2016-02-03 南方电网科学研究院有限责任公司 Method for automatically evaluating novelty of scientific and technical literature by using computer
CN107704469A (en) * 2016-08-08 2018-02-16 中国科学院文献情报中心 The mapping method and device of patent data and industry data
CN107562824A (en) * 2017-08-21 2018-01-09 昆明理工大学 A kind of text similarity detection method
US20190065506A1 (en) * 2017-08-28 2019-02-28 Beijing Baidu Netcom Science And Technology Co., Ltd. Search method and apparatus based on artificial intelligence
CN109858028A (en) * 2019-01-30 2019-06-07 神思电子技术股份有限公司 A kind of short text similarity calculating method based on probabilistic model

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021253873A1 (en) * 2020-06-15 2021-12-23 语联网(武汉)信息技术有限公司 Method and apparatus for retrieving similar document
CN112632907A (en) * 2021-01-04 2021-04-09 北京明略软件***有限公司 Document marking method, device and equipment
CN113094519A (en) * 2021-05-07 2021-07-09 超凡知识产权服务股份有限公司 Method and device for searching based on document
CN113094519B (en) * 2021-05-07 2023-04-14 超凡知识产权服务股份有限公司 Method and device for searching based on document

Also Published As

Publication number Publication date
CN111813930B (en) 2024-02-20
WO2021253873A1 (en) 2021-12-23

Similar Documents

Publication Publication Date Title
CN111444320B (en) Text retrieval method and device, computer equipment and storage medium
CN111813930B (en) Similar document retrieval method and device
US20240028651A1 (en) System and method for processing documents
US11100124B2 (en) Systems and methods for similarity and context measures for trademark and service mark analysis and repository searches
CN105045781B (en) Query term similarity calculation method and device and query term search method and device
CN112667794A (en) Intelligent question-answer matching method and system based on twin network BERT model
US20160260033A1 (en) Systems and Methods for Similarity and Context Measures for Trademark and Service Mark Analysis and Repository Searchess
US12032915B2 (en) Creating and interacting with data records having semantic vectors and natural language expressions produced by a machine-trained model
US10360219B2 (en) Applying level of permanence to statements to influence confidence ranking
US20190340503A1 (en) Search system for providing free-text problem-solution searching
JP2022073981A (en) Source code retrieval
US11379527B2 (en) Sibling search queries
CN117688163B (en) Online intelligent question-answering method and device based on instruction fine tuning and retrieval enhancement generation
JP2020091857A (en) Classification of electronic document
CN112581327B (en) Knowledge graph-based law recommendation method and device and electronic equipment
CN113901783B (en) Domain-oriented document duplication checking method and system
CN116932730B (en) Document question-answering method and related equipment based on multi-way tree and large-scale language model
CN110348539A (en) Short text correlation method of discrimination
CN114138969A (en) Text processing method and device
CN111259180B (en) Image pushing method, device, electronic equipment and storage medium
CN117076636A (en) Information query method, system and equipment for intelligent customer service
CN111062219A (en) Latent semantic analysis text processing method and device based on tensor
CN109684357A (en) Information processing method and device, storage medium, terminal
CN115640375A (en) Technical problem extraction method in patent literature and related equipment
US20240095268A1 (en) Productivity improvements in document comprehension

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant