CN116975202A - Document retrieval method, device, equipment and storage medium - Google Patents

Document retrieval method, device, equipment and storage medium Download PDF

Info

Publication number
CN116975202A
CN116975202A CN202310827625.4A CN202310827625A CN116975202A CN 116975202 A CN116975202 A CN 116975202A CN 202310827625 A CN202310827625 A CN 202310827625A CN 116975202 A CN116975202 A CN 116975202A
Authority
CN
China
Prior art keywords
document
word
importance
preset
target word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310827625.4A
Other languages
Chinese (zh)
Inventor
彭怀瑾
王东
李洪菊
成龙
李志荣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Government And Enterprises Customer Branch Office Of China Mobile Communications Co ltd
China Mobile Communications Group Co Ltd
Original Assignee
Government And Enterprises Customer Branch Office Of China Mobile Communications Co ltd
China Mobile Communications Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Government And Enterprises Customer Branch Office Of China Mobile Communications Co ltd, China Mobile Communications Group Co Ltd filed Critical Government And Enterprises Customer Branch Office Of China Mobile Communications Co ltd
Priority to CN202310827625.4A priority Critical patent/CN116975202A/en
Publication of CN116975202A publication Critical patent/CN116975202A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/319Inverted lists

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of data query, and discloses a document retrieval method, a device, equipment and a storage medium, wherein the method comprises the following steps: acquiring a target word string of a search term input by a user, and determining the semantic importance of the target word string; acquiring candidate documents matched with the target word strings, and determining the correlation degree between the target word strings and the candidate documents; optimizing the correlation according to the semantic importance to obtain the optimized correlation; and sequencing and displaying the candidate documents according to the optimized relevance. According to the invention, the semantic importance of the target word string is determined, and the correlation between the target word string and the candidate document is optimized according to the semantic importance, so that the situation that the candidate document is displayed if the correlation of the candidate document is higher when the candidate document is a document matched with a vocabulary with lower importance is avoided, the search result meets the user requirement more effectively, and the user experience is improved effectively.

Description

Document retrieval method, device, equipment and storage medium
Technical Field
The present invention relates to the field of data query technologies, and in particular, to a method, an apparatus, a device, and a storage medium for retrieving documents.
Background
With the development of internet technology, information on the internet has exploded, and more users search for a desired document through a network. The document retrieval method in the related technology is to inquire candidate documents matched with each vocabulary in the retrieval vocabulary entry, then calculate the relevance between each candidate document and the matched vocabulary, and perform sorting display according to the relevance of each candidate document.
However, in the above manner, when the candidate document is a document matching with a vocabulary with a low importance, if the relevance of the candidate document is high, the candidate document will be displayed, so that the search result cannot meet the user requirement, and the user experience is affected.
Disclosure of Invention
The invention mainly aims to provide a document retrieval method, a device, equipment and a storage medium, and aims to solve the technical problems that when candidate documents are documents matched with words with low importance, if the relevance of the candidate documents is high, the candidate documents are displayed, the user expectation cannot be met, and the user experience is affected in the prior art.
In order to achieve the above object, the present invention provides a document retrieval method applied to an online ranking model, the method comprising the steps of:
Acquiring a target word string of a search term input by a user, and determining the semantic importance of the target word string;
acquiring a candidate document matched with the target word string, and determining the correlation degree between the target word string and the candidate document;
optimizing the correlation according to the semantic importance to obtain an optimized correlation;
and sequencing and displaying the candidate documents according to the optimized relevance.
Optionally, the step of determining the semantic importance of the target word string includes:
acquiring the importance degree of each word segmentation in the target word string relative to the original word of the search term;
determining the sentence length after word segmentation of the search term according to the number of each word segment;
optimizing the importance of the original words based on the sentence length after word segmentation to obtain the semantic importance of the target word string.
Optionally, the step of optimizing the importance of the original word based on the sentence length after word segmentation to obtain the semantic importance of the target word string includes:
the original word importance is subjected to unified processing based on the sentence length after word segmentation through a preset word importance optimization formula to obtain the semantic importance of the target word string, wherein the preset word importance optimization formula is as follows:
W=Important(Sent)*len(Sent),
Wherein W is unified word importance, important (C) is original word importance, and len (ent) is sentence length after word segmentation.
Optionally, the step of optimizing the correlation according to the semantic importance to obtain an optimized correlation includes:
optimizing the relevance based on the semantic importance through a preset relevance optimization formula to obtain optimized word segmentation relevance of each word segmentation, wherein the preset relevance optimization formula is as follows:
S tw =W i *S i
wherein S is tw To optimize the post-segmentation relevance, W i Unified word importance for word i in target word string, S i The relativity of the word i in the target word string and the candidate document is obtained;
superposing the optimized word segmentation relevance of each word segment through a preset relevance superposition formula to obtain the optimized relevance of the target word string, wherein the preset relevance superposition formula is as follows:
wherein R is doc To optimize the post correlation, S tw To optimize the relativity of the word segmentation.
Optionally, the step of obtaining the candidate document matched with the target word string includes:
inquiring a matched word string matched with each word in the target word string from a preset inverted database;
based on the matching word strings, concurrently inquiring matching documents matched with the matching word strings from a preset positive-ranking database;
And recalling the matching document, and taking the matching document as a candidate document matched with the target word string.
Optionally, before the step of obtaining the candidate document matching the target word string, the method further includes:
acquiring a sample document, and selecting a corresponding analysis strategy according to the format of the sample document to analyze the sample document to obtain an analyzed document with uniform format;
performing word segmentation on the title of the parsed document by adopting multi-granularity word segmentation granularity to obtain an inverted index;
and constructing a preset inverted database based on the inverted index, and constructing a preset forward database according to the target sample document corresponding to the inverted index.
Optionally, the constructing of the online sequencing model includes:
acquiring initial data formed by a preset document, and marking the initial data according to preset industry keywords in the preset document to obtain marking data;
training the transform model through the labeling data to obtain an offline sorting model, wherein the output result of the offline sorting model is a single-layer result obtained by combining the two layers of output results of the transform model;
Predicting preset unlabeled document data based on the offline sorting model to obtain a sample correlation result of the preset unlabeled document data;
and training the XGBoost model according to the sample correlation result to obtain an online sequencing model.
In addition, in order to achieve the above object, the present invention also proposes a document retrieval apparatus comprising:
the semantic importance module is used for acquiring a target word string of the search term input by the user and determining the semantic importance of the target word string;
the document correlation module is used for acquiring candidate documents matched with the target word strings and determining the correlation between the target word strings and the candidate documents;
the correlation optimization module is used for optimizing the correlation according to the semantic importance and obtaining the optimized correlation;
and the candidate document display module is used for displaying the candidate documents after sequencing the candidate documents according to the optimized relevance.
In addition, in order to achieve the above object, the present invention also proposes a document retrieval apparatus comprising: a memory, a processor, and a document retrieval program stored on the memory and executable on the processor, the document retrieval program configured to implement the steps of the document retrieval method as described above.
In addition, in order to achieve the above object, the present invention also proposes a storage medium having stored thereon a document retrieval program which, when executed by a processor, implements the steps of the document retrieval method as described above.
The invention provides a document retrieval method, a device, equipment and a storage medium, wherein the method comprises the steps of obtaining a target word string of a retrieval entry input by a user and determining the semantic importance of the target word string; then, candidate documents matched with the target word strings are obtained, and the correlation degree between the target word strings and the candidate documents is determined; finally, optimizing the correlation according to the semantic importance to obtain the optimized correlation; and sequencing and displaying the candidate documents according to the optimized relevance. According to the invention, the semantic importance of the target word string is determined, and the correlation between the target word string and the candidate document is optimized according to the semantic importance, so that the situation that the candidate document is displayed if the correlation of the candidate document is higher when the candidate document is a document matched with a vocabulary with lower importance is avoided, the search result meets the user requirement more effectively, and the user experience is improved effectively.
Drawings
FIG. 1 is a schematic diagram of a document retrieval device in a hardware operating environment according to an embodiment of the present invention;
FIG. 2 is a flow chart of a first embodiment of the document retrieval method of the present invention;
FIG. 3 is a flowchart of a second embodiment of the document retrieval method of the present invention;
FIG. 4 is a flowchart of a third embodiment of a document retrieval method according to the present invention;
FIG. 5 is a schematic diagram showing the overall flow of document retrieval in a third embodiment of the document retrieval method of the present invention;
FIG. 6 is a schematic diagram of an output result of an offline ranking model in a third embodiment of a document retrieval method according to the present invention;
FIG. 7 is a schematic diagram of training an online ranking model in a third embodiment of a document retrieval method according to the present invention;
FIG. 8 is a block diagram showing the structure of a first embodiment of the document retrieving apparatus of the present invention.
The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.
Detailed Description
It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
Referring to fig. 1, fig. 1 is a schematic diagram of a document retrieval device in a hardware running environment according to an embodiment of the present invention.
As shown in fig. 1, the document retrieval apparatus may include: a processor 1001, such as a central processing unit (Central Processing Unit, CPU), a communication bus 1002, a user interface 1003, a network interface 1004, a memory 1005. Wherein the communication bus 1002 is used to enable connected communication between these components. The user interface 1003 may include a Display, an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may further include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a Wireless interface (e.g., a Wireless-Fidelity (Wi-Fi) interface). The Memory 1005 may be a high-speed random access Memory (Random Access Memory, RAM) or a stable nonvolatile Memory (NVM), such as a disk Memory. The memory 1005 may also optionally be a storage device separate from the processor 1001 described above.
It will be appreciated by those skilled in the art that the structure shown in FIG. 1 is not limiting of the document retrieval device, and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.
As shown in fig. 1, an operating system, a network communication module, a user interface module, and a document retrieval program may be included in the memory 1005 as one type of storage medium.
In the document retrieval device shown in fig. 1, the network interface 1004 is mainly used for data communication with a network server; the user interface 1003 is mainly used for data interaction with a user; the processor 1001, the memory 1005 in the document retrieval apparatus of the present invention may be provided in a document retrieval apparatus which invokes a document retrieval program stored in the memory 1005 through the processor 1001 and executes the document retrieval method provided by the embodiment of the present invention.
The embodiment of the invention provides a document retrieval method, and referring to fig. 2, fig. 2 is a schematic flow chart of a first embodiment of the document retrieval method.
In this embodiment, the document retrieval method includes the steps of:
step S10: and acquiring a target word string of the search term input by the user, and determining the semantic importance of the target word string.
It should be noted that, the execution body of the method of the embodiment may be a computing service device with functions of document retrieval, network communication and program running, for example, a mobile phone, a tablet computer, a personal computer, etc., and may also be other electronic devices that implement the same or similar functions. This embodiment and the following embodiments will be described below with the above-described document retrieval apparatus.
It is understood that the search term may be a search term entered by a user for retrieving a document. For example, if the search term entered by the user is "5G technology specifically used", the user desires to obtain a document related to 5G technology use.
The target word string may be a word string formed by each word after the search term is segmented. For example, if the search term input by the user is "5G development strategy path", the target word string may be composed of the words "5G", "development", "strategy" and "development strategy" and the like.
It can be understood that the semantic importance is the weight of each word string in the target word string corresponding to the search term. If a certain word weight is higher, the word semantic importance is higher, and conversely, if the certain word weight is lower, the word semantic importance is lower. In the document retrieval process, the retrieved documents should be matched with the word segmentation with high semantic importance so as to better meet the user expectations.
In a specific implementation, the document retrieval device may be applied to an online ranking model, or the document retrieval device may be an entity terminal of the online ranking model. The document retrieval device can receive the retrieval entry input by a user, perform word segmentation on the retrieval entry to obtain a target word string, and analyze the weight of each word segment in the target word string, wherein the word segment weight can be used for giving higher weight to the word segment with higher search frequency according to the search habit of the user, can be used for giving higher weight to nouns with higher occurrence frequency in industry, can be used for giving higher weight to the nouns, is not limited in this embodiment, and can be used for determining the semantic importance of the target word string after determining the weight of each word segment.
It should be appreciated that the online ranking model described above may be trained from a large number of sample documents and core keywords in each document, such as core vocabulary in title or paragraph information. Wherein the online ranked documents require networking for use by the user. The user inputs the search term on the search equipment, namely, the user can consider that the search term is input to the online ranking model, and the online ranking model can finally output the ranked candidate documents to be displayed to the user.
In the process of word segmentation of the technical vocabulary entry, whether the technical vocabulary entry input by the user has wrongly written or wrongly input characters or the like can be detected, if the server is wrongly input as an accessor, the search vocabulary entry with the wrongly written characters or the wrongly input technical vocabulary entry is corrected, and the corrected search vocabulary entry is segmented again, so that the document search precision is improved, and the user experience is improved.
Step S20: and acquiring a candidate document matched with the target word string, and determining the correlation degree between the target word string and the candidate document.
It should be noted that, the candidate documents may be documents stored in a preset database. The pre-set database may be a pre-set built database storing a large number of sample documents.
It will be appreciated that the above-mentioned degree of correlation may be a measure representing the degree of correlation between the candidate document and the target word string, that is, the higher the degree of correlation between the target word string and the candidate document, the higher the frequency of occurrence of the target word string in the candidate document, and conversely, the lower the degree of correlation between the target word string and the candidate document, the lower the frequency of occurrence of the target word string in the candidate document.
In a specific implementation, the document retrieval device may match each word in the target word string with each document in the preset database as a matching word, and if a document including the matching word or related to the matching word exists in the sample document, recall the document as a candidate document. And obtaining all the recalled candidate documents, and calculating the correlation degree between each word and the candidate document corresponding to each word.
Step S30: and optimizing the correlation according to the semantic importance to obtain the optimized correlation.
In a specific implementation, if the candidate documents ranked according to the relevance level are displayed to the user, the situation that the relevance level of part of words with low semantic importance is higher and ranked in the front appears, so that the search result does not meet the user expectation. The document retrieval equipment can combine the semantic importance of each word segment with the relevance of each candidate document, map the semantic importance and the relevance to the same latitude through multiplying the semantic importance by the relevance, optimize the relevance, then superimpose the optimized relevance of each word segment to obtain the semantic relevance of the target word string, so that the optimized relevance can reflect the semantic importance of the target word string, and the situation that the relevance of the vocabulary with low semantic importance is higher and arranged in the front is avoided.
Step S40: and sequencing and displaying the candidate documents according to the optimized relevance.
In a specific implementation, the above document retrieval device may sort the candidate documents according to the optimized relevance from high to low, where the relevance between the front candidate document and the target word string is high, and the relevance between the rear candidate document and the target word string is low, so as to avoid excessive number of displayed documents, and limit the number of displayed documents, and display a preset number (such as the previous 100) of candidate documents. Because the relevance between the displayed candidate documents and the target word strings is subjected to semantic importance optimization, each candidate document can be related to the semantic importance of the target word strings, and the semantic importance of the high-relevance candidate document is higher, so that the search result accords with the user expectation.
According to the embodiment, the target word strings of the search terms input by the user are obtained, and the semantic importance of the target word strings is determined; then, candidate documents matched with the target word strings are obtained, and the correlation degree between the target word strings and the candidate documents is determined; finally, optimizing the correlation according to the semantic importance to obtain the optimized correlation; and sequencing and displaying the candidate documents according to the optimized relevance. According to the method, the semantic importance of the target word string is determined, and the correlation between the target word string and the candidate document is optimized according to the semantic importance, so that the situation that the candidate document is displayed if the correlation of the candidate document is high when the candidate document is a document matched with a vocabulary with low importance is avoided, the search result meets the user requirement more effectively, and the user experience is improved effectively.
Referring to fig. 3, fig. 3 is a flowchart illustrating a second embodiment of the document searching method according to the present invention.
Based on the first embodiment, in this embodiment, the step of determining the semantic importance of the target word string includes:
step S101: and acquiring the importance degree of each word segmentation in the target word string relative to the original word of the search term.
It should be noted that, the importance of the original word may be a weight calculated by a single word in the target word string based on a sentence length of the search term, where the sentence length of the search term may be a sum of the word numbers of the search term.
In a specific implementation, the document retrieval device may sequentially determine weights of the respective segmented words in the target word string relative to the retrieval word, and use the determined weights as the importance degree of the original words of the respective segmented words. Since the importance of the original word is determined based on the number of words of the search term, the sum of the importance of the original words of each divided word is 1.
Step S102: and determining the sentence length after word segmentation of the search term according to the number of each word segment.
It should be noted that, because the sum of the importance degrees of the original words of each word segment is 1, two sentences with different lengths can have the same vocabulary of the importance degrees of the original words, but the importance degrees of the two vocabularies relative to the respective search terms are different, so that the search terms with different lengths can be caused, the importance degrees of the original words of each word segment can not be compared, and therefore, the relative importance degrees of each word segment relative to the length of the sentence after the word segment of the search term needs to be determined, so that the word importance degrees of the word segments of the search terms with different lengths can be compared.
In a specific implementation, the document retrieval device determines the sentence length after word segmentation of the retrieved term, wherein the sentence length after word segmentation can be the sum of the number accumulation of each word segment.
Step S103: optimizing the importance of the original words based on the sentence length after word segmentation to obtain the semantic importance of the target word string.
In a specific implementation, the document retrieval device can combine the sentence length after the word segmentation with the original word importance degree by combining the sentence length after the word segmentation with the original word importance degree, optimize the original word importance degree, and enable the optimized word importance degree to be the word importance degree of the sentence length after the word segmentation, so that the word importance degrees of the retrieval entries with different lengths corresponding to the word segmentation can be compared, and the optimization and word importance degree can reflect the importance degree of each word in the retrieval entries more accurately.
Further, in this embodiment, the step S103 includes:
step S1031: the original word importance is subjected to unified processing based on the sentence length after word segmentation through a preset word importance optimization formula to obtain the semantic importance of the target word string, wherein the preset word importance optimization formula is as follows:
W=Important(Sent)*len(Sent),
Wherein W is unified word importance, important (C) is original word importance, and len (ent) is sentence length after word segmentation.
In a specific implementation, the document retrieval device may multiply the original word importance of each word with the sentence length after word segmentation through the preset word importance optimization formula, so as to obtain a unified word importance, where the unified word importance may be the semantic importance of the target word string, that is, the semantic importance is the word importance of each word relative to the sentence length after word segmentation.
Further, in this embodiment, the step S30 includes:
step S301: optimizing the relevance based on the semantic importance through a preset relevance optimization formula to obtain optimized word segmentation relevance of each word segmentation, wherein the preset relevance optimization formula is as follows:
S tw =W i *S i
wherein S is tw To optimize the post-segmentation relevance, W i Unified word importance for word i in target word string, S i And the relevance of the word i in the target word string and the candidate document is obtained.
In a specific implementation, the document retrieval device may multiply the semantic importance with the relevance between each word segment and the candidate document through the preset relevance optimization formula, so as to update the semantic importance to the relevance, so that the optimized word segment relevance may reflect the semantic importance of each word segment.
Step S302: superposing the optimized word segmentation relevance of each word segment through a preset relevance superposition formula to obtain the optimized relevance of the target word string, wherein the preset relevance superposition formula is as follows:
wherein R is doc To optimize the post correlation, S tw To optimize the relativity of the word segmentation.
In a specific implementation, the document retrieval device can superimpose the optimized word segmentation relevance of each word segmentation through a preset relevance superimposed formula so as to obtain the overall relevance of the target word string, namely the optimized relevance of the target word string. For example, the search term is a "5G development strategy path", the target word string is a "5G", "development strategy" and a "path", the semantic importance of each word is a "0.3334", "0.4066" and a "0.2600", after determining the semantic importance of each word, the semantic importance is multiplied by the relevance, and the result of multiplying each word is superimposed to obtain the relevance between the target word string and the candidate document.
According to the embodiment, the importance degree of each word segmentation in the target word string relative to the original word of the search term is obtained; determining the sentence length after word segmentation of the search term according to the number of each word segment; the sentence length after word segmentation is used for optimizing the importance of the original words to obtain the semantic importance of the target word strings, so that the optimized relevance obtained after the semantic importance optimizes the relevance can reflect the word importance of each word segment, the probability of recalling documents with higher relevance obtained by hitting non-keywords is reduced, and the accuracy of obtaining candidate documents is effectively improved.
Referring to fig. 4, fig. 4 is a flowchart illustrating a third embodiment of a document retrieval method according to the present invention.
Based on the above embodiments, in this embodiment, the step of obtaining the candidate document matching the target word string includes:
step S201: and inquiring a matched word string matched with each word in the target word string from a preset inverted database.
It should be noted that, the preset inverted database may be a database storing inverted indexes, where the inverted indexes may be obtained by performing word segmentation on titles of a plurality of sample documents.
In a specific implementation, the document retrieval device may perform query matching from a preset inverted database based on each word in the target word string, so as to query a matching word string that matches each word in the target word string.
Step S202: and concurrently inquiring a matching document matched with the matching word string from a preset positive row database based on the matching word string.
It should be noted that, the preset genre database is a database for storing a sample document, and the inverted index in the preset inverted database can be obtained after the title of the sample document is segmented.
In a specific implementation, the document retrieval device may perform inverted indexing according to the matching word string, and query the preset forward database concurrently, and query a matching document matched with the matching word string from the preset forward database.
Step S203: and recalling the matching document, and taking the matching document as a candidate document matched with the target word string.
In a specific implementation, the document retrieval device may recall the queried matching document, and the recalled matching document may be a candidate document matching the target word string. In the recall process, the preset forward database can be queried concurrently to improve the query efficiency.
According to the embodiment, a matched word string matched with each word in the target word string is inquired from a preset inverted database; based on the matching word strings, concurrently inquiring matching documents matched with the matching word strings from a preset positive-ranking database; and recalling the matching document, and taking the matching document as a candidate document matched with the target word string. According to the embodiment, document recall is performed in a mode of combining the inverted index and the forward index, so that the document searching efficiency is further improved.
Further, in this embodiment, before the step of obtaining the candidate document matched with the target word string, the method further includes:
step S21: and acquiring a sample document, and selecting a corresponding analysis strategy according to the format of the sample document to analyze the sample document to obtain an analyzed document with uniform format.
It should be noted that, because the styles and positions of the documents, titles and subtitles in different formats can be different, for example, in pdf and word documents, the subtitle can be located at the beginning of each chapter or paragraph, and the subtitle of ppt can be located at the title position of the beginning page of the chapter, for the document in the title format, a corresponding parsing strategy needs to be selected to parse the sample document into a parsed document with uniform format, so that the construction of the follow-up preset inverted database and the preset forward database is convenient.
In a specific implementation, the document retrieval device can convert the sample document into an html format file. For the pdf document, cutting the pdf by taking a single page as a unit, if the pdf is cut by mistake, splicing the incorrectly-cut paragraphs and the form completely, identifying the text style and the paragraph format, and selecting a third party tool of the type of the pdfcon ter to convert the third party tool into an html format; for the office document, since the bottom layer is an external part similar to the markdown format and contains multiple types of labels, and the external part contains accurate format information, a third party tool of the type liberoffice file can be selected to be converted into an html format file. After the html format file is converted into the html format file, the html format file is converted into json structured data in a unified format, the json structured data can be the analyzed document, and the storage field of the analyzed document contains information such as a title, a subtitle, a paragraph, a table and the like.
Step S22: and cutting words on the title of the parsed document by adopting multi-granularity word cutting granularity to obtain an inverted index.
In a specific implementation, the document retrieval and identification can adopt word segmentation granularity combining coarse granularity and fine granularity to segment the title of the analyzed document, so that the obtained inverted index has richer affix directivity to the document. For example, the title of the indexed file is "5G development strategy path", then the fine granularity word may be "5G", "development", "strategy" and "path", and the coarse granularity word may be "5G development" and "development strategy". Compared with the prior art, the frequency of each word in the document, namely the word frequency, needs to be counted after the word is segmented, the word frequency is used for reflecting the importance degree of the word in the title, and the semantic importance degree can reflect the importance degree of each word in the title because the semantic importance degree of the word is used for optimizing the relevance degree in the follow-up process, so the word frequency of each word can not be counted here, and the performance loss is reduced.
Step S23: and constructing a preset inverted database based on the inverted index, and constructing a preset forward database according to the target sample document corresponding to the inverted index.
In a specific implementation, the document retrieval device may store the obtained inverted index into the same database to construct a preset inverted database. When the preset inverted database is constructed, inverted index fields can be respectively established for the contents such as the title, the subtitle, the text and the like in each document, and different weights are given to different inverted index fields, wherein the weights can be the importance of the original words. When the reverse index field is used for searching sentences with richer semantic meanings such as hit titles or sub-titles, the matched vocabulary can obtain higher weight, and the search result is further optimized.
It should be noted that, the sample documents of the inverted index pair may be stored in the same database to construct a preset forward database. When each document file is stored in a preset positive-displacement database, the document retrieval device can traverse each paragraph in each document file block field (the results of full-text word segmentation information, paragraph classification and the like are stored in the blocks field) to acquire paragraph information, store word segmentation information and word importance of each word in the word segmentation information if the paragraph is a title or a subtitle, and characterize and store the word segmentation information if the paragraph is a text, wherein the word segmentation information can be information formed by each word obtained in the word segmentation process. After all the sections are processed, the obtained storage result is converted into a binary character string by using an encryption algorithm, so that data are further compressed, the effect of data encryption is achieved, and the data privacy is further protected. The preset forward database is required to face the problem that a small data volume is acquired by high concurrent request because the preset forward database is queried for concurrency in the document recall process of the reverse index, so that the preset forward database can be constructed by storing the sample document into the mongo database to avoid the problem, and the overall efficiency of the device is further improved.
It can be understood that the preset inverted database and the preset forward database can be both constructed when the document retrieval device is in an offline state, and the document retrieval function can be executed in response to the retrieval entry input by the user when the document retrieval device is in an online state.
It should be noted that when a sample document in a preset positive-line database is recalled, firstly, a long link can be used to replace an end link on a bottom interactive protocol, that is, three-way handshake and four-way handshake processes of an http protocol are not required to be performed each time access is requested, so that the time consumption of a recall stage is reduced. Secondly, on the cache mechanism, an inverted index preloading mechanism is adopted, and the mechanism is mainly applied to a termvector and a filter of the inverted index, wherein the termvector stores a document index corresponding to each word, and the filter is other screening information possibly attached during query, such as screening information of whether the document creation time is in a given interval or whether a subtitle exists in a document. After the information related to the termvector and the filter is preloaded into the memory, in the subsequent query process, the originally io-dense and efficient underlying disk is interactively converted into a memory retrieval process with very high efficiency, so that the efficiency when the name is not frequently queried is greatly improved. Therefore, long links are used for replacing short links and a memory preloading mechanism in the conversion stage through a bottom layer protocol, and recall efficiency is effectively improved.
For ease of understanding, the description is given with reference to fig. 5, but the present solution is not limited thereto. Fig. 5 is a schematic diagram of an overall flow of file retrieval in a third embodiment of the document retrieval method of the present invention, in fig. 5, under an offline flow, a sample document is obtained and a sample document type is determined, a first type sample document may be an office type document, a second type sample document may be a pdf type document, sample document parsing is performed by adopting a corresponding parsing policy according to the sample document type, and a preset inverted database and a preset forward database are constructed according to the parsed document. Under the online process, obtaining a search term input by a user, analyzing the search term to obtain a target word string, each target word string and semantic importance, recalling a document based on the target word string, recalling the target word string based on a preset reverse database, thus presetting a matched subsequent document in the forward database, sorting the documents according to the relevance optimized by the semantic importance, and finally displaying the sorted candidate documents to the user.
Further, in this embodiment, the constructing the online ranking model includes:
step S01: initial data formed by a preset document is obtained, and the initial data is marked according to preset industry keywords in the preset document, so that marked data are obtained.
It should be noted that, the preset document may be a high-precision scene related document, and a part of the document carries preset industry keywords. The preset industry keywords may be keywords that are labeled to facilitate model efficiency.
In a specific implementation, the document retrieval device may obtain initial data composed of the preset document, determine whether a preset industry keyword can be extracted from the preset document, if the preset industry keyword can be extracted, extract the preset industry keyword, and mark the document as 1, otherwise, if the preset industry keyword cannot be extracted, mark the document as 0, and when marking is completed, extract all marked documents to obtain marked data.
Step S02: training the converter model through the labeling data to obtain an offline sorting model, wherein the output result of the offline sorting model is a single-layer result obtained by combining the two layers of output results after the converter model.
The above-mentioned transducer model may be a deep learning model which uses self-attention and may be assigned with different weights according to the importance of the input data.
In a specific implementation, the labeling data can be divided into a training set and a verification set to train the transducer model, and the accuracy of the transducer model is adjusted according to the verification set to obtain the offline sorting model.
It should be understood that, in the above-mentioned transformer model, the transformer structures of different layers focus on different semantic contents, the transformer structures closer to the input layer tend to extract shallow semantic features such as lexical, syntactic, sequential relationships, etc., while the transformer structures farther from the input layer tend to express deep semantic meaning in the input data, and the transformer model of this embodiment focuses on deep semantic understanding because the requirement for deep semantic understanding is high for document retrieval.
For ease of understanding, the description is given with reference to fig. 6, but the present solution is not limited thereto. Fig. 6 is a schematic diagram of an output result of an offline sorting model in a third embodiment of the document searching method of the present invention, in fig. 6, a first layer of a transducer is a first layer of a transducer structure, a second layer of a transducer is a second layer of a transducer structure, and so on, an N-1 layer of a transducer is an N-1 layer of a transducer structure, an N layer of a transducer is an N layer of a transducer structure, input data first enters the first layer of a transducer, is sequentially input to a next layer after being processed, until a last layer of the first layer is an N layer of a transducer, and is processed by the N layer of the transducer, so that semantic representation can be output. In order to optimize the offline sorting model and improve the accuracy of the offline sorting model, the output results of the last two layers of results, namely the N-1 layer of the transformer and the N layer of the transformer, can be combined, specifically, the output results of the N-1 layer of the transformer and the N layer of the transformer are combined and then pooled, and mapped into a single-layer transformer output latitude for output.
Step S03: and predicting the preset unlabeled document data based on the offline sorting model to obtain a sample correlation result of the preset unlabeled document data.
It should be noted that, the preset unlabeled document data may be data formed by randomly selected unlabeled documents.
In a specific implementation, the preset unlabeled document data can be input into the offline sorting model, the offline sorting model predicts, the sample correlation result of the preset unlabeled document data is output, and the sample correlation result can be used as the input of a subsequent model.
Step S04: and training the XGBoost model according to the sample correlation result to obtain an online sequencing model.
It should be noted that, the XGBoost (eXtreme Gradient Boosting) model may be an integrated machine learning model based on a decision tree.
In a specific implementation, the offline sorting model is mainly used for optimizing the model structure aiming at the semantic expression level, so that the offline sorting model has a good effect on the scene with higher similarity in deep representation of semantics and the like. For the online sorting model, due to the high requirement of the online process on the performance and the condition of the GPU acceleration card being difficult to meet, a heavy model with similar magnitude as the offline sorting model cannot be used on the magnitude of model parameters, so that an XGBoost model can be adopted. Because the offline sorting model has good effect on deep representation of semantics and the like, the sample correlation result output by the offline sorting model can be input into the XGBoost model for efficiency, or the sample correlation result and other sample data can be input into the XGBoost model for training, so that an online sorting model is obtained, and the accuracy of the online sorting model is improved.
For ease of understanding, the description is given with reference to fig. 7, but the present solution is not limited thereto. FIG. 7 is a schematic diagram of training an online ranking model in a third embodiment of the document retrieval method of the present invention, in FIG. 7, in the process of training an offline ranking model, a transducer model is trained by a small amount of labeled data to obtain an offline ranking model, then in the process of predicting the offline ranking model, a large amount of unlabeled document data is predicted by the offline ranking model, and finally in the process of training the online ranking model, an XGBoost model is trained by the prediction result to obtain an online ranking model. In the online mode, the user can input search terms to the online ranking model, the online ranking model executes document search operation, and the ranked candidate documents are output and displayed to the user.
In addition, the embodiment of the invention also provides a storage medium, wherein the storage medium stores a document retrieval program, and the document retrieval program realizes the steps of the document retrieval method when being executed by a processor.
Referring to fig. 8, fig. 8 is a block diagram showing the structure of a first embodiment of the document retrieving apparatus of the present invention.
As shown in fig. 8, a document retrieval apparatus according to an embodiment of the present invention includes:
The semantic importance module 501 is configured to obtain a target word string of a search term input by a user, and determine the semantic importance of the target word string.
The document relevance module 502 is configured to obtain a candidate document that matches the target word string, and determine a relevance between the target word string and the candidate document.
And the correlation optimization module 503 is configured to optimize the correlation according to the semantic importance, and obtain an optimized correlation.
And the candidate document display module 504 is configured to sort and display the candidate documents according to the optimized relevance.
According to the embodiment, the target word strings of the search terms input by the user are obtained, and the semantic importance of the target word strings is determined; then, candidate documents matched with the target word strings are obtained, and the correlation degree between the target word strings and the candidate documents is determined; finally, optimizing the correlation according to the semantic importance to obtain the optimized correlation; and sequencing and displaying the candidate documents according to the optimized relevance. According to the method, the semantic importance of the target word string is determined, and the correlation between the target word string and the candidate document is optimized according to the semantic importance, so that the situation that the candidate document is displayed if the correlation of the candidate document is high when the candidate document is a document matched with a vocabulary with low importance is avoided, the search result meets the user requirement more effectively, and the user experience is improved effectively.
Based on the above-described first embodiment of the document retrieval device of the present invention, a second embodiment of the document retrieval device of the present invention is proposed.
In this embodiment, the semantic importance module 501 is further configured to obtain an importance degree of each word segment in the target word string relative to an original word of the search term; determining the sentence length after word segmentation of the search term according to the number of each word segment; optimizing the importance of the original words based on the sentence length after word segmentation to obtain the semantic importance of the target word string.
As an implementation manner, the semantic importance module 501 is further configured to perform unified processing on the importance of the original word based on the sentence length after word segmentation by using a preset word importance optimization formula to obtain the semantic importance of the target word string, where the preset word importance optimization formula is:
W=Important(Sent)*len(Sent),
wherein W is unified word importance, important (C) is original word importance, and len (ent) is sentence length after word segmentation.
As an implementation manner, the relevance optimization module 503 is further configured to optimize the relevance based on the semantic importance by using a preset relevance optimization formula to obtain an optimized word segmentation relevance of each word segmentation, where the preset relevance optimization formula is:
S tw =W i *S i
Wherein S is tw To optimize the post-segmentation relevance, W i Unified word importance for word i in target word string, S i The relativity of the word i in the target word string and the candidate document is obtained;
superposing the optimized word segmentation relevance of each word segment through a preset relevance superposition formula to obtain the optimized relevance of the target word string, wherein the preset relevance superposition formula is as follows:
wherein R is doc To optimize the post correlation, S tw To optimize the relativity of the word segmentation.
Based on the above-described embodiments of the document updating apparatus of the present invention, a third embodiment of the document updating apparatus of the present invention is proposed.
In this embodiment, the document relevance module 502 is further configured to query, from a preset inverted database, a matching word string that matches each word segment in the target word string; based on the matching word strings, concurrently inquiring matching documents matched with the matching word strings from a preset positive-ranking database; and recalling the matching document, and taking the matching document as a candidate document matched with the target word string.
As an implementation manner, the document correlation module 502 is further configured to obtain a sample document, and select a corresponding parsing policy according to a format of the sample document to parse the sample document, so as to obtain a parsed document with uniform format; performing word segmentation on the title of the parsed document by adopting multi-granularity word segmentation granularity to obtain an inverted index; and constructing a preset inverted database based on the inverted index, and constructing a preset forward database according to the target sample document corresponding to the inverted index.
As one embodiment, the constructing of the online ranking model includes: acquiring initial data formed by a preset document, and marking the initial data according to preset industry keywords in the preset document to obtain marking data; training the transform model through the labeling data to obtain an offline sorting model, wherein the output result of the offline sorting model is a single-layer result obtained by combining the two layers of output results of the transform model; predicting preset unlabeled document data based on the offline sorting model to obtain a sample correlation result of the preset unlabeled document data; and training the XGBoost model according to the sample correlation result to obtain an online sequencing model.
The specific implementation manner of the document retrieval device of the present invention may refer to the above method embodiments, and will not be described herein.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. read-only memory/random-access memory, magnetic disk, optical disk), comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method according to the embodiments of the present invention.
The foregoing description is only of the preferred embodiments of the present invention, and is not intended to limit the scope of the invention, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.

Claims (10)

1. A document retrieval method, wherein the document retrieval method is applied to an online ranking model, the method comprising the steps of:
acquiring a target word string of a search term input by a user, and determining the semantic importance of the target word string;
acquiring a candidate document matched with the target word string, and determining the correlation degree between the target word string and the candidate document;
optimizing the correlation according to the semantic importance to obtain an optimized correlation;
and sequencing and displaying the candidate documents according to the optimized relevance.
2. The document retrieval method of claim 1, wherein the step of determining the semantic importance of the target word string comprises:
acquiring the importance degree of each word segmentation in the target word string relative to the original word of the search term;
determining the sentence length after word segmentation of the search term according to the number of each word segment;
optimizing the importance of the original words based on the sentence length after word segmentation to obtain the semantic importance of the target word string.
3. The document retrieval method according to claim 2, wherein the step of optimizing the importance of the original word based on the segmented sentence length to obtain the semantic importance of the target word string includes:
The original word importance is subjected to unified processing based on the sentence length after word segmentation through a preset word importance optimization formula to obtain the semantic importance of the target word string, wherein the preset word importance optimization formula is as follows:
W=Important(Sent)*len(Sent),
wherein W is unified word importance, important (C) is original word importance, and len (ent) is sentence length after word segmentation.
4. The document retrieval method according to claim 3, wherein the step of optimizing the degree of correlation according to the semantic importance to obtain an optimized degree of correlation comprises:
optimizing the relevance based on the semantic importance through a preset relevance optimization formula to obtain optimized word segmentation relevance of each word segmentation, wherein the preset relevance optimization formula is as follows:
S tw =W i *S i
wherein S is tw To optimize the post-segmentation relevance, W i Unified word importance for word i in target word string, S i The relativity of the word i in the target word string and the candidate document is obtained;
superposing the optimized word segmentation relevance of each word segment through a preset relevance superposition formula to obtain the optimized relevance of the target word string, wherein the preset relevance superposition formula is as follows:
Wherein R is doc To optimize the post correlation, S tw To optimize the relativity of the word segmentation.
5. The document retrieval method according to claim 1, wherein the step of obtaining a candidate document matching the target word string includes:
inquiring a matched word string matched with each word in the target word string from a preset inverted database;
based on the matching word strings, concurrently inquiring matching documents matched with the matching word strings from a preset positive-ranking database;
and recalling the matching document, and taking the matching document as a candidate document matched with the target word string.
6. The document retrieval method according to claim 5, wherein before the step of obtaining the candidate document matching the target word string, further comprising:
acquiring a sample document, and selecting a corresponding analysis strategy according to the format of the sample document to analyze the sample document to obtain an analyzed document with uniform format;
performing word segmentation on the title of the parsed document by adopting multi-granularity word segmentation granularity to obtain an inverted index;
and constructing a preset inverted database based on the inverted index, and constructing a preset forward database according to the target sample document corresponding to the inverted index.
7. The document retrieval method according to any one of claims 1 to 6, wherein the constructing of the online ranking model includes:
acquiring initial data formed by a preset document, and marking the initial data according to preset industry keywords in the preset document to obtain marking data;
training the transform model through the labeling data to obtain an offline sorting model, wherein the output result of the offline sorting model is a single-layer result obtained by combining the two layers of output results of the transform model;
predicting preset unlabeled document data based on the offline sorting model to obtain a sample correlation result of the preset unlabeled document data;
and training the XGBoost model according to the sample correlation result to obtain an online sequencing model.
8. A document retrieval apparatus, the apparatus comprising:
the semantic importance module is used for acquiring a target word string of the search term input by the user and determining the semantic importance of the target word string;
the document correlation module is used for acquiring candidate documents matched with the target word strings and determining the correlation between the target word strings and the candidate documents;
The correlation optimization module is used for optimizing the correlation according to the semantic importance and obtaining the optimized correlation;
and the candidate document display module is used for displaying the candidate documents after sequencing the candidate documents according to the optimized relevance.
9. A document retrieval apparatus, the apparatus comprising: a memory, a processor, and a document retrieval program stored on the memory and executable on the processor, the document retrieval program configured to implement the steps of the document retrieval method of any one of claims 1 to 7.
10. A storage medium having stored thereon a document retrieval program which, when executed by a processor, implements the steps of the document retrieval method according to any one of claims 1 to 7.
CN202310827625.4A 2023-07-06 2023-07-06 Document retrieval method, device, equipment and storage medium Pending CN116975202A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310827625.4A CN116975202A (en) 2023-07-06 2023-07-06 Document retrieval method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310827625.4A CN116975202A (en) 2023-07-06 2023-07-06 Document retrieval method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN116975202A true CN116975202A (en) 2023-10-31

Family

ID=88478868

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310827625.4A Pending CN116975202A (en) 2023-07-06 2023-07-06 Document retrieval method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116975202A (en)

Similar Documents

Publication Publication Date Title
US11403680B2 (en) Method, apparatus for evaluating review, device and storage medium
CN109829104B (en) Semantic similarity based pseudo-correlation feedback model information retrieval method and system
CN110674429B (en) Method, apparatus, device and computer readable storage medium for information retrieval
US8073877B2 (en) Scalable semi-structured named entity detection
US9390161B2 (en) Methods and systems for extracting keyphrases from natural text for search engine indexing
US20100094835A1 (en) Automatic query concepts identification and drifting for web search
US20070136280A1 (en) Factoid-based searching
CN111984851B (en) Medical data searching method, device, electronic device and storage medium
JP7451747B2 (en) Methods, devices, equipment and computer readable storage media for searching content
CN110990533B (en) Method and device for determining standard text corresponding to query text
US20210103622A1 (en) Information search method, device, apparatus and computer-readable medium
CN110147494B (en) Information searching method and device, storage medium and electronic equipment
CN108875065B (en) Indonesia news webpage recommendation method based on content
CN111160007B (en) Search method and device based on BERT language model, computer equipment and storage medium
CN115563515B (en) Text similarity detection method, device, equipment and storage medium
JP5179564B2 (en) Query segment position determination device
JP5315726B2 (en) Information providing method, information providing apparatus, and information providing program
CN111859066B (en) Query recommendation method and device for operation and maintenance work order
CN116975202A (en) Document retrieval method, device, equipment and storage medium
CN111368036B (en) Method and device for searching information
CN111159526B (en) Query statement processing method, device, equipment and storage medium
JP2010282403A (en) Document retrieval method
CN110851560A (en) Information retrieval method, device and equipment
CN113806510B (en) Legal provision retrieval method, terminal equipment and computer storage medium
CN116992874B (en) Text quotation auditing and tracing method, system, device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination