CN111460090A - Vector-based document retrieval method and device, computer equipment and storage medium - Google Patents

Vector-based document retrieval method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN111460090A
CN111460090A CN202010143243.6A CN202010143243A CN111460090A CN 111460090 A CN111460090 A CN 111460090A CN 202010143243 A CN202010143243 A CN 202010143243A CN 111460090 A CN111460090 A CN 111460090A
Authority
CN
China
Prior art keywords
retrieval
vector
document
vocabulary
keyword
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010143243.6A
Other languages
Chinese (zh)
Inventor
王盼
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
OneConnect Smart Technology Co Ltd
OneConnect Financial Technology Co Ltd Shanghai
Original Assignee
OneConnect Financial Technology Co Ltd Shanghai
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by OneConnect Financial Technology Co Ltd Shanghai filed Critical OneConnect Financial Technology Co Ltd Shanghai
Priority to CN202010143243.6A priority Critical patent/CN111460090A/en
Publication of CN111460090A publication Critical patent/CN111460090A/en
Priority to PCT/CN2021/070585 priority patent/WO2021175005A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of information retrieval, and provides a vector-based document retrieval method, a vector-based document retrieval device, computer equipment and a storage medium, wherein the vector-based document retrieval method comprises the following steps: acquiring retrieval information input at a client; extracting each vocabulary in the retrieval information, and converting the retrieval information into retrieval vectors according to the semantics of each vocabulary in the retrieval information; calculating the similarity between the retrieval vector and a preformed document vector matched with the resource document; and sequencing the resource documents matched with the document vectors according to the similarity, and taking the sequencing result of the resource documents as a retrieval result. By implementing the method and the device, the problems of low retrieval accuracy and high retrieval difficulty of the document retrieval method in the prior art can be solved.

Description

Vector-based document retrieval method and device, computer equipment and storage medium
Technical Field
The present invention relates to the field of information retrieval technologies, and in particular, to a vector-based document retrieval method, an apparatus, a computer device, and a storage medium.
Background
With the continuous development of information technology, the amount of information generated by various industries is larger and larger, and the conventional retrieval mode can not meet the daily retrieval requirements of people gradually. In order to obtain a search result quickly and accurately, a conventional document search method needs to be improved so that a user can obtain the search result quickly. At present, the existing document retrieval method generally establishes a boolean expression including a first keyword and a logical operator, and then retrieves a document by using a logical relationship and the first keyword represented by the boolean expression.
Although the retrieval result can be obtained by the document retrieval method, the Boolean logic formula structure is not easy to comprehensively and accurately reflect the requirements of the user, so that the retrieval skill of the user is high, and the retrieval is only carried out according to the word ideograph of the first keyword input by the user, so that the retrieval result is low in accuracy and high in retrieval difficulty.
In summary, the document retrieval method in the prior art has the problems of low retrieval accuracy and high retrieval difficulty.
Disclosure of Invention
The invention provides a vector-based document retrieval method, a vector-based document retrieval device, computer equipment and a storage medium, which are used for solving the problems of high difficulty in extracting characteristic data and low classification accuracy of the conventional vector-based document retrieval method.
A first embodiment of the present invention provides a vector-based document retrieval method, including:
acquiring retrieval information input at a client;
extracting each vocabulary in the retrieval information, and converting the retrieval information into retrieval vectors according to the semantics of each vocabulary in the retrieval information;
calculating the similarity between the retrieval vector and a preformed document vector matched with the resource document;
and sequencing the resource documents matched with the document vectors according to the similarity, and taking the sequencing result of the resource documents as a retrieval result.
A second embodiment of the present invention provides a vector-based document retrieval apparatus including:
the retrieval information acquisition module is used for acquiring retrieval information input at the client;
the retrieval vector acquisition module is used for extracting each vocabulary in the retrieval information and converting the retrieval information into a retrieval vector according to the semantics of each vocabulary in the retrieval information;
the similarity obtaining module is used for calculating the similarity between the retrieval vector and a preformed document vector matched with the resource document;
and the retrieval result acquisition module is used for sequencing the resource documents matched with the document vectors according to the similarity and taking the sequencing result of the resource documents as the retrieval result.
A third embodiment of the present invention provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the vector-based document retrieval method provided by the first embodiment of the present invention when executing the computer program.
A fourth embodiment of the present invention provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the steps of the vector-based document retrieval method provided by the first embodiment of the present invention.
In the vector-based document retrieval method, the vector-based document retrieval device, the computer equipment and the storage medium, retrieval information input at a client is firstly obtained, then each vocabulary in the retrieval information is extracted, the retrieval information is converted into the retrieval vector according to the semantics of each vocabulary in the retrieval information, then the similarity between the retrieval vector and a preformed document vector matched with the resource document is calculated, finally the resource documents matched with the document vector are sequenced according to the similarity, and the sequencing result of the resource documents is taken as the retrieval result. The method and the device can solve the problems of low retrieval accuracy and high retrieval difficulty of a document retrieval method in the prior art by extracting each vocabulary in the retrieval information and converting the retrieval information into a retrieval vector according to the semantics of each vocabulary in the retrieval information.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.
FIG. 1 is a diagram of an application environment of a vector-based document retrieval method according to a first embodiment of the present invention;
FIG. 2 is a flowchart of a vector-based document retrieval method according to a first embodiment of the present invention;
FIG. 3 is a flowchart of step 12 of the vector-based document retrieval method of the first embodiment of the present invention;
FIG. 4 is a flowchart of step 122 of the vector-based document retrieval method of the first embodiment of the present invention;
FIG. 5 is a flowchart of step 1222 of the vector-based document retrieval method of the first embodiment of the present invention;
FIG. 6 is yet another flowchart of a vector-based document retrieval method according to the first embodiment of the present invention;
FIG. 7 is a block diagram of a vector-based document retrieval apparatus according to a second embodiment of the present invention;
FIG. 8 is a schematic block diagram of a vector-based document retrieval apparatus according to a second embodiment of the present invention;
FIG. 9 is a schematic block diagram of a vector-based document retrieval apparatus according to a second embodiment of the present invention;
fig. 10 is a block diagram of a computer device according to a third embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The vector-based document retrieval method provided by the first embodiment of the present invention can be applied to an application environment as shown in fig. 1, in which a client (computer device) communicates with a server through a network. The server acquires retrieval information input at the client, extracts each vocabulary in the retrieval information, converts the retrieval information into a retrieval vector according to the semantics of each vocabulary in the retrieval information, calculates the similarity between the retrieval vector and a preformed document vector matched with the resource document, sorts the resource document matched with the document vector according to the similarity, takes the sorting result of the resource document as the retrieval result, and sends the retrieval result to the client. Among them, the client (computer device) may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices. The server can be implemented by an independent server or a server cluster composed of a plurality of servers.
In a first embodiment of the present invention, as shown in fig. 2, a vector-based document retrieval method is provided, which is described by taking the method applied to the server side in fig. 1 as an example, and includes the following steps 11 to 15.
Step 11: retrieval information input at a client is acquired.
The retrieval information includes information related to a specified target document, which is input by a user in order to obtain the target document.
Step 12: and extracting each vocabulary in the retrieval information, and converting the retrieval information into retrieval vectors according to the semantics of each vocabulary in the retrieval information.
Wherein, a piece of retrieval information corresponds to a retrieval vector.
Further, as an implementation manner of this embodiment, as shown in fig. 3, the step 12 specifically includes the following steps 121 to 124:
step 121: and acquiring each vocabulary contained in the retrieval information.
Wherein the step 121 comprises: firstly, performing word segmentation processing on the retrieval information, and then removing stop words in the retrieval information. The method comprises the steps of firstly isolating each character in retrieval information, inquiring whether a phrase can be formed between every two adjacent characters in a dictionary, if the phrase can be formed, forming the adjacent characters into the phrase, if the phrase cannot be formed, isolating the adjacent phrase, then taking the question or symbol without the phrase as a stop word, and taking each formed phrase as each vocabulary in the retrieval information. It should be noted that the adjacent characters may be two consecutive adjacent characters, or may be three consecutive adjacent characters, and this is not limited herein.
To enable the above step 121 to be understood more clearly, examples are listed: the search information is 'exploration secret of universe', each character in each search information is isolated to obtain 'exploration/search/universe/Olympic'/secret ', whether each adjacent character can form a phrase in a dictionary is inquired to obtain' exploration/universe/Olympic '/secret', the phrases 'exploration', 'universe', 'secret' are used as each vocabulary in the search information, and the 'words' are used as stop words because the 'words' do not form phrases with other characters in the search information.
Step 122: and respectively expressing the words with similar semantemes in all the words by using the same first key word.
Specifically, words with similar semantics in the words are represented by the same keyword. When the number of the vocabulary is plural, the near-meaning word of each vocabulary should be obtained by query so that the plural vocabularies can be represented by the same first keyword.
Step 123: and counting the occurrence times of each first keyword.
Specifically, the occurrence frequency of the first keyword can be counted through a TF-IDF algorithm (term frequency statistical algorithm), and the occurrence frequency of each first keyword is marked.
For example, the above steps 122 to 123 can be more clearly understood, and examples are listed: specifically, first keywords of each vocabulary are obtained according to the semantics of each vocabulary, and the occurrence frequency of each first keyword is counted. For example, there are 3 times of occurrence of the word "happy", 4 times of occurrence of the word "happy", 5 times of occurrence of the word "happy", and all the synonyms of the words "happy", and "happy", at this time, the synonym "happy" is taken as a first keyword, the words "happy", and "happy" are represented by the first keyword "happy", the number of occurrences of the first keyword "happy" is 12 times of the sum of the number of occurrences of the words "happy", and the number of occurrences of the first keyword "happy" is marked as 12.
Step 124: and mapping each first keyword and the occurrence frequency of each first keyword to a vector dictionary to obtain a retrieval vector.
The retrieval vector is an M-dimensional vector, and M represents the number of first keywords in the dictionary. The words in the vector dictionary are stored in a tree-like hierarchical structure, in the structure, each word is connected in a node mode, and the smaller the number of the nodes with difference between the words is, the higher the similarity between the words is. In addition, the representation mode of each vocabulary in the vector dictionary can be in a three-dimensional matrix form, and specifically, the vector dictionary is composed of each vocabulary and the occurrence frequency of each vocabulary.
It should be noted that in the present embodiment, the number of each first keyword in the vector dictionary is only one, that is, each first keyword in the vector dictionary is different. For example, in the vector dictionary, when the number of occurrences of the first keyword a is 5, the keyword a is assigned a value of 5 in the search vector formed in the vector dictionary, and when the number of occurrences of the keyword B is 0, the keyword B is assigned a value of 0 in the search vector formed in the space vector.
Step 13: and calculating the similarity between the retrieval vector and a preformed document vector matched with the resource document.
The document vector contains information in the resource document and is used for representing the resource document, and one document vector represents one resource document.
Specifically, the similarity between the retrieval vector and a document vector which is formed in advance and matched with the resource document is obtained through the following formula (1) calculation:
cosθ=(a·b)/(|a|×|b|)(1)
wherein cos θ represents the similarity between the retrieval vector and the document vector, θ represents the angle between the retrieval vector and the document vector, a represents the retrieval vector, b represents the document vector,. represents the point product of the vectors, | a | represents the modulus of the retrieval vector, and | b | represents the modulus of the document vector.
It should be noted that when there are a plurality of resource documents, there should be a plurality of document vectors, and the similarity between each document vector and the search vector can be calculated according to the above formula.
Step 14: and sequencing the resource documents matched with the document vectors according to the similarity, and taking the sequencing result of the resource documents as a retrieval result.
Wherein, the larger the numerical value of the similarity is, the more similar the representative document vector and the retrieval vector are. Specifically, the resource documents matched with the document vectors are sequenced according to the sequence of similarity from big to small and then sent to the client.
In addition, in this embodiment, the resource document corresponding to the similarity reaching the preset threshold may also be sent to the client.
Through the implementation of the steps 11 to 14, the retrieval information can be converted into the retrieval vector according to the semantics, the similarity between the retrieval vector and the document vector used for representing the resource document is calculated, the resource documents are sorted according to the size of the similarity, the retrieval is performed without using a retrieval expression, and the resource documents are retrieved according to the semantics of the vocabulary in the retrieval information, so that the resource documents obtained through the retrieval only meet the condition that the vocabulary characters in the retrieval information are highly matched, the vocabulary semantics in the resource documents and the retrieval information are ignored, the retrieval difficulty is reduced, and the precision of the retrieval of the resource documents is improved.
Further, as an implementation manner of this embodiment, as shown in fig. 4, the step 122 specifically includes the following steps 1221 to 1223:
step 1221: and acquiring at least one synonym matched with each vocabulary from the synonym forest.
The synonym can be a vocabulary which is arbitrarily associated with each vocabulary in a synonym forest.
Step 1222: and calculating the semantic similarity between each vocabulary and the corresponding matched synonym.
Specifically, semantic similarity between each vocabulary and each corresponding synonym is calculated respectively.
Step 1223: and when the similarity between the vocabulary and the synonym reaches a preset first threshold value, taking the synonym as a first keyword matched with the corresponding vocabulary.
Through the implementation of the above steps 1221 to 1223, the first keyword can be obtained from the synonym according to the semantics.
Further, as an implementation manner of this embodiment, as shown in fig. 5, the step 1222 specifically includes the following steps 12221 to 12223:
step 12221: acquiring first semantic information according to the vocabulary, and acquiring second semantic information according to the synonym;
step 12222: acquiring a first semantic keyword from the first semantic information to form a first data set, and acquiring a second semantic keyword from the second semantic information to form a second data set;
step 12223: and calculating the similarity between the first data set and the second data set, and taking the calculated similarity as the semantic similarity.
As for the step 12221, specifically, the vocabulary and the synonym are searched in the dictionary or the search engine, so as to obtain the search feedback result corresponding to the vocabulary and the synonym, the feedback result corresponding to the vocabulary is used as the first semantic information, and the feedback result corresponding to the synonym is used as the second semantic information. Specifically, the feedback results are generally the interpretation of words and synonyms.
For the step 12222, the keywords in the first semantic information are specifically extracted as the first semantic keywords, and the keywords in the second semantic information are extracted as the second semantic information.
For the step 12223, specifically, the similarity between each word in the first data set and each word in the second data set is calculated, the maximum similarity between each word in the first data set and each word in the second data set may be used as the semantic similarity, and the average value of the similarities between each word in the first data set and each word in the second data set may also be used as the semantic similarity.
Through the implementation of the steps 12221 to 12223, the semantic similarity between each vocabulary and the synonym can be judged according to the semantics of each vocabulary and the synonym, which is beneficial to realizing the retrieval of the resource document according to the semantics and improving the retrieval precision.
Further, as an implementation manner of the embodiment, it is necessary to convert the resource document into a document vector for retrieval, and as shown in fig. 6, acquiring a document vector formed in advance and matched with the resource document specifically includes the following steps 21 to 24:
step 21: and acquiring each resource vocabulary from the resource document.
Wherein the resource document represents a carrier of the description information. The resource document may be obtained by crawling information in a website page using a crawler technology, or may be obtained by obtaining text information using a character recognition technology, which is not limited specifically here.
Step 22: and respectively expressing the resource vocabularies with similar semantics in each resource vocabulary by using second key words.
The method specifically comprises the steps of firstly extracting all resource fields in the resource document, wherein the resource fields are characters, characters and the like existing in the resource document, and the sequence of the extracted resource fields is the same as the sequence of the original resource fields in the resource document.
In addition, since the method of representing the resource vocabularies with similar semantics in each resource vocabulary by the second keyword in step 22 is the same as the method of representing the resource vocabularies with similar semantics in each vocabulary by the same first keyword in step 122, the description thereof is omitted.
Step 23: and counting the occurrence times of each second keyword.
The number of times of the second keywords is counted through a TF-IDF algorithm, and the number of times of the second keywords is marked.
Step 24: and mapping each second keyword and the occurrence frequency of each second keyword to a vector dictionary to obtain a document vector.
Since the method for mapping each second keyword and the occurrence frequency of each second keyword to the vector dictionary in step 24 is the same as the method for mapping each first keyword and the occurrence frequency of each first keyword to the vector dictionary in step 124, the description thereof is omitted here.
Through the implementation of the above steps 21 to 24, the resource document can be converted into the document vector, so that the similarity between the document vector and the retrieval vector is calculated in the above steps 11 to 14, and the resource document corresponding to the document vector is obtained.
It should be noted that, in the present embodiment, the content in the retrieval information should not be limited, the retrieval information may also be a document, and when the retrieval information is a document, the method according to the above steps 11 to 14 can obtain a resource document matching the document to obtain a resource document most similar to the document.
It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.
A second embodiment of the present invention provides a vector-based document retrieval apparatus that corresponds one-to-one to the vector-based document retrieval method provided in the first embodiment described above.
Further, as shown in fig. 7, the vector-based document retrieval apparatus includes a retrieval information acquisition module 41, a retrieval vector acquisition module 42, a similarity acquisition module 43, and a retrieval result acquisition module 44.
The functional modules are explained in detail as follows:
a retrieval information acquisition module 41 for acquiring retrieval information input at the client;
a retrieval vector obtaining module 42, configured to extract each vocabulary in the retrieval information, and convert the retrieval information into a retrieval vector according to semantics of each vocabulary in the retrieval information;
a similarity obtaining module 43, configured to calculate a similarity between the retrieval vector and a pre-formed document vector that matches the resource document;
and the retrieval result acquisition module 44 is configured to rank the resource documents matched with the document vectors according to the similarity, and use the ranking result of the resource documents as the retrieval result.
Further, as an embodiment of the present embodiment, as shown in fig. 8, the search vector acquisition module 42 includes a participle processing unit 421, a first keyword acquisition unit 422, a statistics unit 423, and a search vector acquisition unit 424. The detailed functions of the functional units are as follows:
a word segmentation processing unit 421, configured to obtain each word included in the search information;
a first keyword obtaining unit 422, configured to represent words with similar semantics in each word with the same first keyword;
a counting unit 423 for counting the number of times of occurrence of each first keyword;
a retrieval vector obtaining unit 424, configured to map each first keyword and the occurrence number of each first keyword to a vector dictionary to obtain a retrieval vector.
Further, as an embodiment of the present embodiment, as shown in fig. 9, the first keyword acquisition unit 422 includes a synonym acquisition subunit 4221, a semantic similarity acquisition subunit 4222, and a first keyword acquisition subunit 4223. The detailed functions of each functional subunit are as follows:
a synonym obtaining subunit 4221, configured to obtain at least one synonym that matches each vocabulary from a synonym forest;
a semantic similarity obtaining subunit 4222, configured to calculate semantic similarities between each vocabulary and the corresponding matched synonym;
the first keyword obtaining subunit 4223 is configured to, when the similarity between the vocabulary and the synonym reaches a preset first threshold, take the synonym as a first keyword matched with the corresponding vocabulary.
Further, as an implementation manner of this embodiment, the semantic similarity obtaining subunit 4222 includes a semantic information obtaining subunit, a data set obtaining subunit, and a semantic similarity operator unit. The detailed functions of each functional subunit are as follows:
the semantic information acquisition subunit is used for acquiring first semantic information according to the vocabulary and acquiring second semantic information according to the synonym;
the data set acquisition subunit is used for acquiring a first semantic keyword from the first semantic information to form a first data set and acquiring a second semantic keyword from the second semantic information to form a second data set;
and the semantic similarity operator unit is used for calculating the similarity between the first data set and the second data set and taking the calculated similarity as the semantic similarity.
Further, as an implementation manner of this embodiment, the vector-based document retrieval apparatus further includes a resource vocabulary obtaining module, a second keyword obtaining module, a word frequency statistics module, and a document vector obtaining module. The detailed functions of the functional modules are as follows:
the resource vocabulary acquisition module is used for acquiring each resource vocabulary from the resource document;
the second keyword acquisition module is used for respectively representing the resource vocabularies with similar semantics in each resource vocabulary by using second keywords;
the word frequency counting module is used for counting the occurrence frequency of each second keyword;
and the document vector acquisition module is used for mapping each second keyword and the occurrence frequency of each second keyword to a vector dictionary to obtain a document vector.
For specific limitations of the vector-based document retrieval apparatus, reference may be made to the above limitations of the vector-based document retrieval method, which are not described herein again. The respective modules in the vector-based document retrieval apparatus described above may be wholly or partially implemented by software, hardware, and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
A third embodiment of the present invention provides a computer device, which may be a server, and the internal structure diagram of which may be as shown in fig. 10. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing data involved in the vector-based document retrieval method. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement the vector-based document retrieval method provided by the first embodiment of the present invention.
A fourth embodiment of the present invention provides a computer-readable storage medium on which a computer program is stored, which, when executed by a processor, implements the steps of the vector-based document retrieval method provided by the first embodiment of the present invention, such as steps 11 to 14 shown in fig. 2, steps 121 to 124 shown in fig. 3, steps 1221 to 1223 shown in fig. 4, steps 12221 to 12223 shown in fig. 5, and steps 21 to 24 shown in fig. 6. Alternatively, the computer program, when executed by a processor, implements the functions of the modules/units of the vector-based document retrieval method provided by the first embodiment described above. To avoid repetition, further description is omitted here.
It will be understood by those of ordinary skill in the art that all or a portion of the processes of the methods of the embodiments described above may be implemented by a computer program that may be stored on a non-volatile computer-readable storage medium, which when executed, may include the processes of the embodiments of the methods described above, wherein any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.

Claims (10)

1. A document retrieval method based on vectors is characterized by comprising the following steps:
acquiring retrieval information input at a client;
extracting each vocabulary in the retrieval information, and converting the retrieval information into retrieval vectors according to the semantics of each vocabulary in the retrieval information;
calculating the similarity between the retrieval vector and a preformed document vector matched with the resource document;
and sequencing the resource documents matched with the document vectors according to the similarity, and taking the sequencing result of the resource documents as a retrieval result.
2. The vector-based document retrieval method of claim 1, wherein the extracting each vocabulary in the retrieval information and converting the retrieval information into a retrieval vector according to the semantics of each vocabulary in the retrieval information comprises:
acquiring each vocabulary contained in the retrieval information;
respectively representing the vocabularies with similar semantics in the vocabularies by using the same first key word;
counting the occurrence frequency of each first keyword;
and mapping each first keyword and the occurrence frequency of each first keyword to a vector dictionary to obtain the retrieval vector.
3. The method of claim 2, wherein said representing the words with similar semantics by the same first keyword comprises:
acquiring at least one synonym matched with each vocabulary from a synonym forest;
calculating semantic similarity between each vocabulary and the corresponding matched synonym;
and when the similarity between the vocabulary and the synonym reaches a preset first threshold value, taking the synonym as the first keyword matched with the corresponding vocabulary.
4. The vector-based document retrieval method of claim 3, wherein the step of calculating semantic similarity of each vocabulary with the corresponding matched synonym comprises:
acquiring first semantic information according to the vocabulary, and acquiring second semantic information according to the synonym;
acquiring a first semantic keyword from the first semantic information to form a first data set, and acquiring a second semantic keyword from the second semantic information to form a second data set;
and calculating the similarity between the first data set and the second data set, and taking the calculated similarity as the semantic similarity.
5. The vector-based document retrieval method of claim 1, wherein obtaining the pre-formed document vector that matches the resource document comprises:
acquiring each resource vocabulary from the resource document;
respectively representing the resource vocabularies with similar semantics in each resource vocabulary by using second key words;
counting the occurrence frequency of each second keyword;
and mapping each second keyword and the occurrence frequency of each second keyword to a vector dictionary to obtain the document vector.
6. A vector-based document retrieval apparatus, comprising:
the retrieval information acquisition module is used for acquiring retrieval information input at the client;
the retrieval vector acquisition module is used for extracting each vocabulary in the retrieval information and converting the retrieval information into a retrieval vector according to the semantics of each vocabulary in the retrieval information;
the similarity obtaining module is used for calculating the similarity between the retrieval vector and a preformed document vector matched with the resource document;
and the retrieval result acquisition module is used for sequencing the resource documents matched with the document vectors according to the similarity and taking the sequencing results of the resource documents as retrieval results.
7. The vector-based document retrieval device of claim 6, wherein the retrieval vector acquisition module comprises:
the word segmentation processing unit is used for acquiring each word contained in the retrieval information;
the first keyword acquisition unit is used for respectively representing the vocabularies with similar semantemes in all the vocabularies by using the same first keyword;
the counting unit is used for counting the occurrence frequency of each first keyword;
and the retrieval vector acquisition unit is used for mapping each first keyword and the occurrence frequency of each first keyword to a vector dictionary to obtain the retrieval vector.
8. The vector-based document retrieval device according to claim 7, wherein the first keyword acquisition unit includes:
the synonym obtaining subunit is used for obtaining at least one synonym matched with each vocabulary from the synonym forest;
a semantic similarity obtaining subunit, configured to calculate semantic similarities between the vocabularies and the corresponding matched synonyms;
and the first keyword acquisition subunit is used for taking the synonym as the first keyword matched with the corresponding vocabulary when the similarity between the vocabulary and the synonym reaches a preset first threshold value.
9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the vector based document retrieval method according to any of claims 1 to 5 when executing the computer program.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the vector-based document retrieval method according to any one of claims 1 to 5.
CN202010143243.6A 2020-03-04 2020-03-04 Vector-based document retrieval method and device, computer equipment and storage medium Pending CN111460090A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202010143243.6A CN111460090A (en) 2020-03-04 2020-03-04 Vector-based document retrieval method and device, computer equipment and storage medium
PCT/CN2021/070585 WO2021175005A1 (en) 2020-03-04 2021-01-07 Vector-based document retrieval method and apparatus, computer device, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010143243.6A CN111460090A (en) 2020-03-04 2020-03-04 Vector-based document retrieval method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN111460090A true CN111460090A (en) 2020-07-28

Family

ID=71680091

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010143243.6A Pending CN111460090A (en) 2020-03-04 2020-03-04 Vector-based document retrieval method and device, computer equipment and storage medium

Country Status (2)

Country Link
CN (1) CN111460090A (en)
WO (1) WO2021175005A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112506864A (en) * 2020-12-18 2021-03-16 北京百度网讯科技有限公司 File retrieval method and device, electronic equipment and readable storage medium
WO2021175005A1 (en) * 2020-03-04 2021-09-10 深圳壹账通智能科技有限公司 Vector-based document retrieval method and apparatus, computer device, and storage medium
CN113449063A (en) * 2021-06-25 2021-09-28 树根互联股份有限公司 Method and device for constructing document structure information retrieval library
CN113704408A (en) * 2021-08-31 2021-11-26 工银科技有限公司 Retrieval method, retrieval apparatus, electronic device, storage medium, and program product
CN114818678A (en) * 2022-03-28 2022-07-29 西安远诺技术转移有限公司 Scientific and technological achievement management method and platform and electronic equipment

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115878759B (en) * 2023-01-05 2023-05-26 京华信息科技股份有限公司 Text searching method, device and storage medium
CN116401212B (en) * 2023-06-07 2023-08-11 东营市第二人民医院 Personnel file quick searching system based on data analysis
CN116842138A (en) * 2023-07-24 2023-10-03 上海诚狐信息科技有限公司 Document-based retrieval method, device, equipment and storage medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110019669B (en) * 2017-10-31 2021-06-29 北京国双科技有限公司 Text retrieval method and device
US10885121B2 (en) * 2017-12-13 2021-01-05 International Business Machines Corporation Fast filtering for similarity searches on indexed data
CN110276071B (en) * 2019-05-24 2023-10-13 众安在线财产保险股份有限公司 Text matching method and device, computer equipment and storage medium
CN110807149B (en) * 2019-10-11 2023-07-14 卓尔智联(武汉)研究院有限公司 Retrieval method, device and storage medium
CN111460090A (en) * 2020-03-04 2020-07-28 深圳壹账通智能科技有限公司 Vector-based document retrieval method and device, computer equipment and storage medium

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021175005A1 (en) * 2020-03-04 2021-09-10 深圳壹账通智能科技有限公司 Vector-based document retrieval method and apparatus, computer device, and storage medium
CN112506864A (en) * 2020-12-18 2021-03-16 北京百度网讯科技有限公司 File retrieval method and device, electronic equipment and readable storage medium
CN112506864B (en) * 2020-12-18 2023-07-25 北京百度网讯科技有限公司 File retrieval method, device, electronic equipment and readable storage medium
CN113449063A (en) * 2021-06-25 2021-09-28 树根互联股份有限公司 Method and device for constructing document structure information retrieval library
CN113704408A (en) * 2021-08-31 2021-11-26 工银科技有限公司 Retrieval method, retrieval apparatus, electronic device, storage medium, and program product
CN114818678A (en) * 2022-03-28 2022-07-29 西安远诺技术转移有限公司 Scientific and technological achievement management method and platform and electronic equipment

Also Published As

Publication number Publication date
WO2021175005A1 (en) 2021-09-10

Similar Documents

Publication Publication Date Title
CN111460090A (en) Vector-based document retrieval method and device, computer equipment and storage medium
CN109635273B (en) Text keyword extraction method, device, equipment and storage medium
CN108804641B (en) Text similarity calculation method, device, equipment and storage medium
CN109815333B (en) Information acquisition method and device, computer equipment and storage medium
KR102491172B1 (en) Natural language question-answering system and learning method
US10482146B2 (en) Systems and methods for automatic customization of content filtering
CN112035599B (en) Query method and device based on vertical search, computer equipment and storage medium
CN108875065B (en) Indonesia news webpage recommendation method based on content
CN112115232A (en) Data error correction method and device and server
CN111159363A (en) Knowledge base-based question answer determination method and device
CN110175221B (en) Junk short message identification method by combining word vector with machine learning
CN111445968A (en) Electronic medical record query method and device, computer equipment and storage medium
CN110134777B (en) Question duplication eliminating method and device, electronic equipment and computer readable storage medium
CN111221944A (en) Text intention recognition method, device, equipment and storage medium
CN111737997A (en) Text similarity determination method, text similarity determination equipment and storage medium
CN114880447A (en) Information retrieval method, device, equipment and storage medium
CN113434636A (en) Semantic-based approximate text search method and device, computer equipment and medium
CN112632261A (en) Intelligent question and answer method, device, equipment and storage medium
CN115203421A (en) Method, device and equipment for generating label of long text and storage medium
CN114398882A (en) Document processing method, device, equipment and storage medium
WO2021051934A1 (en) Method and apparatus for extracting key contract term on basis of artificial intelligence, and storage medium
CN110309252B (en) Natural language processing method and device
CN112527985A (en) Unknown problem processing method, device, equipment and medium
CN112579781A (en) Text classification method and device, electronic equipment and medium
CN114398903B (en) Intention recognition method, device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination