CN118093809A

CN118093809A - Document searching method and device and electronic equipment

Info

Publication number: CN118093809A
Application number: CN202410095887.0A
Authority: CN
Inventors: 李伟光
Original assignee: Great Wall Motor Co Ltd
Current assignee: Great Wall Motor Co Ltd
Priority date: 2024-01-23
Filing date: 2024-01-23
Publication date: 2024-05-28

Abstract

The disclosure provides a document searching method, a document searching device and electronic equipment. The method comprises the following steps: receiving a document searching problem input by a user, and determining a plurality of similar fragments corresponding to the problem from a pre-stored document; determining a first preset number of target fragments from a plurality of similar fragments; determining positioning information of the target fragment in the belonged document; analyzing and processing the target fragment to obtain a fragment result; and outputting the positioning information and the fragment result. The method and the device have the advantages that the user can conveniently pick up and view the position of the target segment in the belonged document according to the positioning information, the retrieval efficiency is improved, and the user can conveniently view the relevant context segment of the target segment according to the positioning information, so that the user can further judge whether the retrieved target segment is consistent with the problem by combining the context segment, and can accurately judge whether the target segment is a retrieval result wanted by the user, thereby judging the accuracy and the reliability of the retrieved target segment.

Description

Document searching method and device and electronic equipment

Technical Field

The disclosure relates to the technical field of document searching, in particular to a document searching method, a document searching device and electronic equipment.

Background

In the development process of enterprises, various enterprise documents exist, and in order to increase the circulation and collaboration of knowledge among departments, the enterprise documents are searched. However, when searching an enterprise document, a technician typically determines the relevant enterprise document according to experience, which may cause a problem that the determined relevant enterprise document is inaccurate and cannot meet the requirement of a user, and the user cannot directly obtain the position of the relevant portion in the enterprise document, which results in a problem of low searching efficiency.

In view of this, how to facilitate the user to obtain the position of the relevant part in the enterprise document to improve the retrieval efficiency is a technical problem to be solved.

Disclosure of Invention

In view of the above, the present disclosure aims to provide a document searching method, a document searching device and an electronic device, so as to solve the problem in the prior art that a user cannot directly obtain the position of a relevant part in an enterprise document, which results in low searching efficiency.

Based on the above object, a first aspect of the present disclosure proposes a document searching method, the method comprising:

receiving a document searching problem input by a user, and determining a plurality of similar fragments corresponding to the problem from a pre-stored document;

determining a first preset number of target fragments from a plurality of similar fragments;

Determining positioning information of the target fragment in the belonged document;

analyzing and processing the target fragment to obtain a fragment result;

And outputting the positioning information and the fragment result.

Based on the same inventive concept, a second aspect of the present disclosure proposes a document searching apparatus comprising:

the similar segment determining module is configured to receive a document searching problem input by a user, and determine a plurality of similar segments corresponding to the problem from a pre-stored document;

A target segment determining module configured to determine a first preset number of target segments from a plurality of the similar segments;

A positioning information determining module configured to determine positioning information of the target segment in the affiliated document;

the analysis processing module is configured to analyze and process the target fragment to obtain a fragment result;

And the output module is configured to output the positioning information and the fragment result.

Based on the same inventive concept, a third aspect of the present disclosure proposes an electronic device comprising a memory, a processor and a computer program stored on the memory and executable by the processor, the processor implementing the method as described above when executing the computer program.

From the above, the document searching method, the document searching device and the electronic equipment provided by the disclosure can be seen. A document searching problem input by a user is received, and a plurality of similar fragments corresponding to the problem are determined from a pre-stored document. And determining a first preset number of target fragments from the plurality of similar fragments, so that the determined target fragments are fragments with larger degree of correlation with the problem, and the retrieval accuracy can be improved. The method has the advantages that the positioning information of the target segment in the belonged document is determined, the user can conveniently call and check the position of the target segment in the belonged document according to the positioning information, the retrieval efficiency is improved, the user can conveniently check the relevant context segment of the target segment according to the positioning information, so that the user can further judge whether the retrieved target segment is consistent with the problem by combining the context segment, and whether the target segment is a retrieval result wanted by the user can be accurately judged, so that the accuracy and the reliability of the retrieved target segment are judged. The target fragment is analyzed to obtain a fragment result, the fragment result obtained after the target fragment is analyzed is a result related to the problem, and the result can be used as an answer corresponding to the problem, so that a user can conveniently check and acquire key information in the target fragment. And outputting the positioning information and the fragment result.

Drawings

In order to more clearly illustrate the technical solutions of the present disclosure or related art, the drawings required for the embodiments or related art description will be briefly described below, and it is apparent that the drawings in the following description are only embodiments of the present disclosure, and other drawings may be obtained according to these drawings without inventive effort to those of ordinary skill in the art.

FIG. 1A is a flow chart of a document searching method of an embodiment of the present disclosure;

FIG. 1B is a schematic diagram of a document search output according to an embodiment of the present disclosure;

FIG. 2 is a flow chart of a document search output method of an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a document searching apparatus according to an embodiment of the present disclosure;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the disclosure.

Detailed Description

For the purposes of promoting an understanding of the principles and advantages of the disclosure, reference will now be made to the embodiments illustrated in the drawings and specific language will be used to describe the same.

It should be noted that unless otherwise defined, technical or scientific terms used in the embodiments of the present disclosure should be given the ordinary meaning as understood by one of ordinary skill in the art to which the present disclosure pertains. The terms "first," "second," and the like, as used in embodiments of the present disclosure, do not denote any order, quantity, or importance, but rather are used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that elements or items preceding the word are included in the element or item listed after the word and equivalents thereof, but does not exclude other elements or items. The terms "connected" or "connected," and the like, are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", etc. are used merely to indicate relative positional relationships, which may also be changed when the absolute position of the object to be described is changed.

Based on the description of the background technology, enterprise standard document retrieval is one of document retrieval modes. In the development process of enterprises, various knowledge files are deposited, research personnel in the professional field rapidly and accurately acquire the knowledge and experience in the production process, the circulation and cooperation of the knowledge among departments are increased, and the technical center initiates the rapid and accurate retrieval requirement based on the knowledge in the enterprises.

The embodiment mainly aims at efficiently and accurately positioning the document where the searched content of the user is located, and can enable the user to refer to the original text positioning information and judge the accuracy and reliability of the answer by self while giving the answer.

As described above, how to facilitate the user to obtain the location of the relevant portion in the enterprise document to improve the retrieval efficiency becomes an important research problem.

Based on the above description, as shown in fig. 1A, the document searching method provided in this embodiment includes:

step 101, receiving a document searching problem input by a user, and determining a plurality of similar fragments corresponding to the problem from a pre-stored document.

In particular, the document in the present embodiment is a document stored in advance in a database. For example, the document in the present embodiment may be an enterprise document stored in advance in an enterprise database. Or the document in the present embodiment may be a literature document stored in advance in a literature library.

When a user performs query and search on a document, a document search problem to be queried is input, and the query and the search are performed on the document based on the document search problem input by the user. The problem input box is preset, and a user can input the problem needing to be searched in the preset problem input box.

After receiving the questions input by the user, determining a plurality of similar fragments corresponding to the questions from the pre-stored documents. For example, when an enterprise document in an enterprise database is searched, a user inputs a question of "enterprise standard" in a question input box, receives a question input by the user, searches a prestored enterprise document, and determines a plurality of similar segments related to "enterprise standard".

Step 102, determining a first preset number of target fragments from a plurality of similar fragments.

In the implementation, the target fragment is a fragment with high correlation degree to the problem in a plurality of similar fragments. The first preset number is preset, and a user can also perform custom setting according to own requirements.

For example, the number of similar segments is 10, and the first preset number is 3. And sequencing the 10 similar fragments according to the correlation degree of the similar fragments and the problems, and determining 3 target fragments with high correlation degree from the 10 sequenced similar fragments.

Step 103, determining the positioning information of the target fragment in the belonged document.

In specific implementation, the prestored document includes relevant information of each fragment. The pre-stored document comprises positioning information of each fragment in the document. And obtaining positioning information corresponding to the target fragment. The positioning information may be represented in the form of a fragment paragraph number or fragment coordinates.

And 104, analyzing and processing the target fragment to obtain a fragment result.

In the specific implementation, the fragment result is the result information which is output to the user in the form of an answer after integrating the target fragment.

And splicing the target segment with the above segment and/or the below segment corresponding to the target segment to obtain a spliced segment. And carrying out extraction processing and summarization processing on the spliced fragments to obtain fragment results which are output to a user in an answer form.

And 105, outputting the positioning information and the fragment result.

In particular, as shown in fig. 1B, fig. 1B is a schematic diagram of a document search output according to an embodiment of the present disclosure.

The user inputs the questions in the question input box, queries and searches the documents, analyzes and processes the target fragments related to the questions to obtain fragment results in the form of answers, and outputs the fragment results to the user, so that the user can conveniently check the answers corresponding to the questions.

In addition, the target segment is positioned, positioning information of the document to which the target segment belongs is determined, the positioning information and the positioned target segment are output to a user, the user can conveniently check original text content corresponding to the target segment in the document, and the user can judge whether the searched target segment is an answer corresponding to the problem or not based on the original text content corresponding to the target segment, so that whether the searched target segment is accurate and reliable or not is judged.

Through the above-described embodiments, a document search question input by a user is received, and a plurality of similar segments corresponding to the question are determined from a document stored in advance. And determining a first preset number of target fragments from the plurality of similar fragments, so that the determined target fragments are fragments with larger degree of correlation with the problem, and the retrieval accuracy can be improved. The method has the advantages that the positioning information of the target segment in the belonged document is determined, the user can conveniently call and check the position of the target segment in the belonged document according to the positioning information, the retrieval efficiency is improved, the user can conveniently check the relevant context segment of the target segment according to the positioning information, so that the user can further judge whether the retrieved target segment is consistent with the problem by combining the context segment, and whether the target segment is a retrieval result wanted by the user can be accurately judged, so that the accuracy and the reliability of the retrieved target segment are judged. The target fragment is analyzed to obtain a fragment result, the fragment result obtained after the target fragment is analyzed is a result related to the problem, and the result can be used as an answer corresponding to the problem, so that a user can conveniently check and acquire key information in the target fragment. And outputting the positioning information and the fragment result.

In some embodiments, further comprising: pre-training to obtain a neural network model for segment sequencing;

Step 102 comprises:

and 1021, inputting the plurality of similar segments into the neural network model, and sorting the plurality of similar segments according to the degree of correlation with the problem by using the neural network model to obtain a plurality of sorted similar segments.

Step 1022, determining the first preset number of target segments from the sorted multiple similar segments.

In specific implementation, the neural network model is a model which is obtained through pre-training and can sort fragments according to the degree of correlation with the problem. The neural network Model may be a Large Model, and the Large Model (Large Model) refers to a neural network Model including parameters of a very Large scale. Large models are typically capable of learning together a variety of different natural language processing (Natural Language Processing, abbreviated NLP) tasks, such as machine translation, text summarization, question-answering systems, and the like. This may enable large models to learn a broader and generalized language understanding capability. According to different scenes, the large models can be divided into four major categories, namely large language models, computer vision (comprising images and videos), audio and multi-mode large models. The present embodiment is preferably a large language model.

Inputting a plurality of similar fragments retrieved by query into a large model, sorting the plurality of similar fragments according to the degree of correlation with the problem by using the large model to obtain a plurality of sorted similar fragments, and determining a first preset number of target fragments with high degree of correlation from the plurality of sorted similar fragments.

For example, 10 similar segments are input into a large model, the 10 similar segments are ranked according to the degree of correlation with the problem by using the large model, and the degree of correlation between the similar segments and the problem is sequentially from high to low: a first similar segment, a second similar segment, a third similar segment, a fourth similar segment, a fifth similar segment, … …, a tenth similar segment. When the first preset number is 3, the first similar segment, the second similar segment and the third similar segment are taken as target segments.

By the scheme, the large model has wider and generalized language understanding capability, and the target segment with high degree of relevance to the problem can be accurately determined from a plurality of similar segments by using the large model.

In some embodiments, further comprising: pre-training to obtain a neural network model for segment extraction summary;

Step 104 comprises:

step 1041, performing a splicing process on the target segment and the above segment and/or the below segment corresponding to the target segment, to obtain a spliced segment.

In the implementation, the above segments and/or the below segments corresponding to the target segments are obtained, and the splicing process is carried out on the target segments and the above segments and/or the below segments corresponding to the target segments, so that the spliced segments are obtained.

Specifically, the target segment and the above segment corresponding to the target segment are subjected to splicing processing to obtain a spliced segment. Or the target segment and the following segment corresponding to the target segment are subjected to splicing processing to obtain the spliced segment. Or the target segment and the upper segment and the lower segment corresponding to the target segment are subjected to splicing treatment to obtain the spliced segment.

Step 1042, extracting and summarizing the spliced segments through the neural network model to obtain the segment result.

In specific implementation, the neural network model is a model which is obtained through pre-training and can extract and summarize fragments. The neural network Model may be a Large Model, and the Large Model (Large Model) refers to a neural network Model including parameters of a very Large scale. Large models are typically capable of learning together a variety of different natural language processing (Natural Language Processing, abbreviated NLP) tasks, such as machine translation, text summarization, question-answering systems, and the like. This may enable large models to learn a broader and generalized language understanding capability. According to different scenes, the large models can be divided into four major categories, namely large language models, computer vision (comprising images and videos), audio and multi-mode large models. The present embodiment is preferably a large language model.

The fragment result is the result information which is output to the user in the form of an answer after integrating the target fragment. Extracting keywords from the spliced fragments through the large model to obtain target keywords related to the problems; and carrying out induction summarization on the target keywords to obtain fragment results.

By the scheme, the target segment and the above segment and/or the following segment corresponding to the target segment are subjected to the splicing treatment, so that the obtained spliced segment is more complete and comprehensive, easy to understand by a user, convenient for the user to check the target segment and judge whether the searched target segment is accurate and reliable, and the condition that the incomplete target segment is output to the user is avoided. In addition, as the large model has wider and generalized language understanding capability, target keywords related to the questions can be accurately extracted from the spliced fragments through the large model, the fragment results in the form of answers can be obtained by summarizing the target keywords, and users can directly obtain the answers to the questions based on the fragment results, so that the users can conveniently and directly obtain the answers to the questions without summarizing the answers by themselves, and the retrieval efficiency is improved.

In some embodiments, step 1041 comprises:

Step 10411, determining an above segment corresponding to the target segment and/or a below segment corresponding to the target segment.

Step 10412, determining a total segment word length of the target segment and the above segment and/or the below segment corresponding to the target segment.

And 10413, determining that the total segment word length is smaller than or equal to a preset word length threshold, and performing splicing treatment on the target segment and the upper segment and/or the lower segment corresponding to the target segment to obtain the spliced segment.

In the implementation, before the splicing processing is performed on the target segment and the upper segment and/or the lower segment corresponding to the target segment, the total field of the upper segment and/or the lower segment corresponding to the target segment is judged, and the word length of the total field is determined to be smaller than or equal to a preset word length threshold value.

For example, the preset word length threshold is 600 characters. The segment word length corresponding to the target segment is 400 characters, the segment word length corresponding to the upper segment is 150 characters, and the segment word length corresponding to the lower segment is 100 characters.

If the target segment and the above segment are spliced, the total segment word length 400 characters +150 characters = 550 characters of the target segment and the above segment is judged, and because the total segment word length 550 characters of the target segment and the above segment are smaller than the preset word length threshold 600 characters, the total segment word length is determined to meet the corresponding word length condition, and the target segment and the above segment are spliced to obtain the spliced segment.

If the target segment and the following segment are subjected to splicing processing, judging that the total segment word length 400 characters+100 characters=500 characters of the target segment and the following segment, wherein the total segment word length 500 characters of the target segment and the following segment are smaller than a preset word length threshold value 600 characters, determining that the total segment word length meets the corresponding word length condition, and carrying out splicing processing on the target segment and the following segment to obtain the spliced segment.

If the target segment, the upper segment and the lower segment are spliced, the total segment word length 400 characters +150 characters +100 characters = 650 characters of the target segment, the upper segment and the lower segment is judged, and because the total segment word length 650 characters of the target segment, the upper segment and the lower segment are larger than a preset word length threshold value 600 characters, it is determined that the total segment word length does not meet the corresponding word length condition, and the target segment, the upper segment and the lower segment cannot be spliced.

According to the scheme, the total segment word length of the target segment and the upper segment and/or the lower segment corresponding to the target segment is judged, and when the total segment word length is smaller than or equal to the preset word length threshold value, the target segment and the upper segment and/or the lower segment corresponding to the target segment are spliced to obtain the spliced segment. Therefore, the total segment word length of the spliced segment obtained by splicing can be ensured to be within a preset word length threshold, the obtained spliced segment is prevented from being overlong, and the target keyword can be extracted more accurately when the spliced segment is extracted by a subsequent large model.

In some embodiments, step 10411 comprises:

step 10411A, determining whether a title exists in the first preset word length before the target segment in the belonging document.

Step 10411B, responding to the existence of a title in the first preset word length before the target segment, and taking the text between the title and the target segment as the above segment.

And step 10411C, responding to that no title exists in the first preset word length before the target segment, and taking the text in the first preset word length before the target segment as the above segment.

In particular, the first predetermined word length is determined based on empirical and practical testing. For example, the first preset word length may be preset according to experience, and the user may also perform custom setting according to own needs. The first preset word length may also be determined based on actual testing.

And when the above segment corresponding to the target segment is determined, the target segment is upwards valued, and whether a title exists in the first preset word length before the target segment in the document is judged according to the preset first preset word length. And when the title exists in the first preset word length in front of the target segment, taking the text between the title and the target segment as the above segment. And when the title does not exist in the first preset word length in front of the target segment, taking the text in the first preset word length as the above segment. Wherein the title includes a large title or a subtitle.

In addition, when no title exists in the first preset word length before the target segment, the text is indicated to be in the first preset word length before the target segment, and in order to conveniently check the complete text of the target segment, the title before the target segment is determined, and the text between the title and the target segment is used as the above segment.

For example, the first preset word length is 150 characters, when a title exists in 150 characters before the target segment, for example, when a title exists in a position of 100 characters before the target segment, 100 characters between the title and the target segment are taken as the above segments.

For example, if the first preset word length is 150 characters, and if no title exists in 150 characters before the target segment, the text in the first preset word length is used as the above segment. Or when the title does not exist in 150 characters before the target fragment, determining that the title exists in the position of 200 characters before the target fragment, and taking the 200 characters between the title and the target fragment as the above fragment.

And/or the number of the groups of groups,

Step 10411D, determining whether a title exists in the second preset word length behind the target segment in the belonging document.

And step 10411E, responding to the existence of a title in a second preset word length behind the target segment, and taking the text between the title and the target segment as the following segment.

And step 10411F, responding to that no title exists in a second preset word length behind the target segment, and taking the text in the second preset word length behind the target segment as the following segment.

In particular, the second predetermined word length is determined based on empirical and practical testing. For example, the second preset word length may be preset empirically, and the user may also perform custom setting according to his own needs. The second preset word length may also be determined based on actual testing.

And when determining the following segment corresponding to the target segment, downward value taking is carried out on the target segment, and whether a title exists in a second preset word length after the target segment in the document is judged according to the second preset word length which is preset. And when the title exists in the second preset word length behind the target segment, taking the text between the title and the target segment as the following segment. And when the title does not exist in the second preset word length behind the target segment, taking the text in the second preset word length as the following segment. Wherein the title includes a large title or a subtitle.

In addition, when no title exists in the second preset word length behind the target segment, the text is indicated to be in the second preset word length behind the target segment, and in order to conveniently check the complete text of the target segment, the title behind the target segment is determined, and the text between the title and the target segment is used as the following segment.

For example, the second preset word length is 100 characters, when a title exists in 100 characters behind the target segment, if a title exists in a 50 character position behind the target segment, the 50 characters between the title and the target segment are used as the following segments.

For example, if the second preset word length is 100 characters, and if no title exists in 100 characters behind the target segment, the text in the second preset word length is used as the following segment. Or when the title does not exist in 100 characters behind the target segment, determining that the title exists in the 150-character position behind the target segment, and taking the 150 characters between the title and the target segment as the following segment.

Through the scheme, whether the titles exist in the preset word lengths before and after the target segment or not is judged, when the titles exist in the preset word lengths before and after the target segment, the text between the title before the target judgment and the target segment is used as the above segment or the following segment, and the user can conveniently check the target segment and related complete segments. When no title exists in the preset word length before and after the target segment, the text in the preset word length is used as the above segment or the below segment, so that the overlong spliced segment after splicing can be avoided, and the target keywords in the spliced segment can be conveniently extracted by the subsequent large model.

In some embodiments, step 1042 comprises:

and step 10421, extracting keywords from the spliced segment through the neural network model to obtain a target keyword.

In specific implementation, the neural network model is a model which is obtained through pre-training and can be used for extracting keywords from the fragments. The neural network model may be a large model, among other things. The target keywords are keywords of answer content related to the questions in the spliced segment.

Specifically, the spliced segment is input into a large model, keyword extraction processing is carried out on the spliced segment by using the large model, and target keywords related to the problems are output.

And step 10422, summarizing and combining the target keywords through the neural network model to obtain the fragment result.

In specific implementation, the neural network model is a model which is obtained through pre-training and can be used for summarizing and combining fragments. The neural network model may be a large model, among other things. The fragment result is the result information which is output to the user in the form of an answer after integrating the target fragment.

Specifically, inputting the target keywords into a large model, summarizing the extracted target keywords by utilizing the large model to obtain segment results, and outputting the segment results.

Specifically, a first preset number of target segments are input into a fine-tuned large model, the most relevant target segments are judged by the model, answer content relevant to the questions in the most relevant target segments is extracted, and target keywords are obtained. And summarizing the target keywords by using the large model to obtain result information in the form of answers, and outputting the result information to a user. The large model fine-tuning is in a form of providing question-answer pairs by service personnel, relates to knowledge of most document contents, and improves summarizing output capability of the large model for related documents.

According to the scheme, as the large model has wider and generalized language understanding capability, target keywords related to the questions can be accurately extracted from the spliced fragments through the large model, the fragment results in the form of answers are obtained by summarizing the target keywords, and the answers of the questions can be obtained by users directly based on the fragment results, so that the users can conveniently and directly obtain the answers of the questions without summarizing the answers by the users themselves, and the retrieval efficiency is improved.

In some embodiments, the pre-storing the document includes:

step 10A, a text segment of the document is obtained.

In specific implementation, the text segment is a segment with text as the content type in the document. In addition, the document also includes a table or a chart and other content type fragments.

Step 10B, determining text information in the text segment; and/or determining a segment vector according to the text segment and the text title corresponding to the text segment; and/or determining segment positioning information of the text segment.

And step 10C, storing at least one of the document, the text information, the fragment vector and the fragment positioning information.

In the implementation, after the document is uploaded to a corresponding database, the document is analyzed to obtain relevant information of the document, and the document and the relevant information of the document are stored in the database together.

After the document is uploaded to the database, the document is preprocessed. The preprocessing process of the document specifically comprises the following steps: and judging the content of the document, and removing repeated content, watermark information and the like in the document.

Analyzing the document to obtain and store the relevant information of the document, specifically comprising the following steps: judging the types of the document contents, and respectively analyzing the different types of the document contents, such as independently analyzing texts, tables and pictures.

For text content, the analysis processing is sequentially performed from top to bottom, firstly, the analysis processing is performed on the title and related information, the catalog is removed, and the target is not analyzed (the catalog has an influence in actual search, and errors are easily caused due to high repeatability of the title).

Different titles and text contents are split in the analysis process, and the text with longer paragraphs is cut off, so that the hit rate of the text contents can be improved. And after all the texts are analyzed, vectorizing the texts, wherein the vectorizing process of the texts adopts unified vectors of text titles and text slices, so that corresponding documents can be quickly positioned in the retrieval process.

When the problem inputted by the user is highly related to the title, but the determined clip result is not the clip corresponding to the title, the accuracy of the retrieval is greatly affected. In order to solve the above problems, padding is performed by means of ES search. The ES is an open source retrieval mode, and the bottom layer is based on an algorithm of BM25 keywords, so that the defect of semantic similarity can be filled, and the generated vector, the text fragment and the related document information are uniformly stored in a redis database for retrieval. The elastomer search is an open source search engine based on a Lucene library, and is called ES for short. The BM25 algorithm is called Best Matching 25, and is a text retrieval algorithm based on probability statistics.

Wherein the database stores three key value pairs under a document fragment, the value of the first key value pair is Chinese information (namely text information) of the text fragment, the value of the second key value pair is a fragment vector generated by the text fragment and a title corresponding to the text fragment, the value of the third key value pair is a set of key value pairs, and the main values comprise: the cloud object storage (Object Storage Service, abbreviated as OSS) of the document stores information such as position information, text title information, file page number information, text total page number information, text segment coordinate information, PDF execution standard information, author information, keyword information (corresponding to keywords extracted by text segments), and the like.

According to the scheme, the document and the related information of the document are stored in advance, so that the fragment vector of the fragment in the document is conveniently fetched when the document is searched, the similarity is calculated based on the problem vector and the fragment vector, and the first similar fragment can be accurately determined based on the similarity. In addition, through pre-storing the document and the related information of the document, after the target fragment is conveniently searched and determined, the positioning information corresponding to the target fragment is conveniently called, the fragment result and the positioning information corresponding to the target fragment are output to the user, the user can conveniently check the original text corresponding to the target fragment, and whether the searched target fragment is accurate and reliable or not can be conveniently determined by the user.

In some embodiments, step 101 comprises:

in step 1011, a plurality of document snippets in the document are determined.

In particular, the document snippets are each snippet in a pre-stored document.

Step 1012, performing a similarity operation on the problem vector of the problem and the segment vectors of the plurality of document segments through a proximity classification algorithm, so as to obtain the similarity between the problem and each document segment, and determining a second preset number of first similar segments from the plurality of document segments according to the similarity.

In specific implementation, a neighbor classification algorithm (K-NearestNeighbor, KNN for short) is also called a K nearest neighbor classification algorithm, which is one of the data mining classification methods. K nearest neighbors, meaning K nearest neighbors, means that each sample can be represented by its nearest K neighbor values. The neighbor algorithm is a method of classifying each record in the data set.

And carrying out vectorization processing on the problem to obtain a problem vector. And calling a segment vector corresponding to each document segment in the prestored document related information. And respectively calculating the similarity between the problem vector and each segment vector through a proximity classification algorithm, and determining a second preset number of first similar segments with high similarity from the plurality of document segments according to the similarity.

The second preset number is preset, and the user can also perform self-defined setting according to own requirements. For example, if the second preset number is 5,5 first similar segments with highest similarity are determined from the plurality of document segments according to the similarity.

Step 1013, performing a coincidence operation on the question key of the question and the segment keys of the plurality of document segments, so as to obtain the coincidence of the question and each document segment, and determining a third preset number of second similar segments from the plurality of document segments according to the coincidence.

In specific implementation, the problem is subjected to keyword extraction processing to obtain a problem keyword. And calling the segment keywords corresponding to each document segment in the prestored document related information. And respectively calculating the coincidence degree of the problem keywords and the keywords of each segment, and determining a third preset number of second similar segments with high coincidence degree of the keywords from the plurality of document segments according to the coincidence degree. Or determining a third preset number of second similar segments with high keyword coincidence from the plurality of document segments through ES searching.

The ES is an open source retrieval mode, and the bottom layer is based on an algorithm of BM25 keywords, so that the defect of semantic similarity can be filled, and the generated vector, the text fragment and the related document information are uniformly stored in a redis database for retrieval. The elastomer search is an open source search engine based on a Lucene library, and is called ES for short. The BM25 algorithm is called Best Matching 25, and is a text retrieval algorithm based on probability statistics. When the question input by the user is highly related to the title, but the actual answer comes from other text contents, the accuracy of the search is greatly affected, and the ES search can effectively compensate for the problem that the result of the segment determined by the related problem to the title is not the segment corresponding to the title.

The third preset number is preset, and the user can also perform self-defined setting according to own requirements. For example, if the third preset number is 5, 5 second similar segments with the highest keyword coincidence degree are determined from the plurality of document segments according to the similarity.

Step 1014, combining the second preset number of first similar segments and the third preset number of second similar segments to form a plurality of the similar segments.

In specific implementation, a plurality of similar fragments are obtained by combining a second preset number of first similar fragments and a third preset number of second similar fragments. For example, the second preset number is 5, the third preset number is 5, and the first preset number is 3. And combining the 5 first similar fragments and the 5 second similar fragments to obtain 10 similar fragments, and inputting the 10 similar fragments into a large model to obtain 3 target fragments with high correlation degree to the problem.

According to the scheme, the first similar segments are determined according to the vector similarity based on the proximity algorithm, the second similar segments are determined according to the keyword coincidence ratio based on the ES search, so that the vector similarity and the keyword coincidence ratio between the segments and the problem are comprehensively considered in the determined similar segments, the determined similar segments are more comprehensive and accurate, and the target segments with high correlation degree with the problem are determined from the similar segments based on the large model, so that the determined target segments are more accurate. In addition, the problem that the similarity between the title and the problem in the determined first similar segment is high but the similarity between the text content corresponding to the title and the problem is low in the process of determining the first similar segment based on the vector similarity can be made up through the ES search.

It should be noted that the embodiments of the present disclosure may be further described in the following manner:

As shown in fig. 2, fig. 2 is a flowchart of a document search output method according to an embodiment of the present disclosure. The method comprises the following steps:

And (one) uploading the document.

The original document is provided by the business party in a unified PDF format. Uploading and storing the document on the corresponding server, and decrypting, uploading and storing the document by calling a database of the server.

And (II) analyzing the document.

And analyzing the decrypted PDF document, wherein the analyzing process comprises all data of the content of the PDF document, the analyzed data are stored in a redis database, the analyzing rule is that the title and the subtitle are analyzed independently, all the analyzed contents are stored in a vectorization mode, and the analysis also comprises coordinate information of the document fragment, so that the target fragment can be positioned at a later stage conveniently.

After the document is uploaded to the corresponding database, analyzing the document to obtain the relevant information of the document, and storing the document and the relevant information of the document together to the database.

And (III) retrieving the document.

And vectorizing the questions input by the user to obtain question vectors. And carrying out similarity calculation according to the fragment vectors of the documents in the database, and determining 10 first similar fragments with highest similarity.

And acquiring the questions asked by the user in an interface form, carrying out vectorization processing on the questions, and keeping the adopted model consistent with the model of the text segment generated vector. Calculating the similarity of all text fragments and problems input by a user, adopting a KNN algorithm, judging the similarity according to the distance of the obtained result, extracting 10 similar fragments with the nearest distance, adding titles to the 10 similar fragments, inputting the 10 similar fragments into a large model, carrying out secondary sequencing on the 10 similar fragments by using the large model, selecting 3 target fragments with high degree of relevance to the problems, carrying out numerical judgment on fragments leaning on the third position, discarding the fragments when the distance is more than or equal to 0.36, and forming a fragment result set by the second similar fragment with the highest keyword coincidence degree and the rest 2 target fragments from the ES cluster.

And (IV) sorting large models.

And (3) sorting the target fragments in a large model, and selecting 3 target fragments with high degree of correlation with the problem.

And (V) splicing the fragments.

And (3) carrying out context value taking on the target segment, namely taking 1 upwards, taking 9 downwards and limiting the total segment word length, stopping value taking if the value taking process meets a large title, returning the spliced segment after splicing, wherein the value taking rule adopts experience and actual test selection, and the total segment word length is not more than 600 characters, so that the follow-up large model can extract the target keyword more accurately.

If the subtitle is hit, all the text below the subtitle is correspondingly fetched, and if the text is a text segment in the text body, the title is fetched upwards, and the downward large title is fetched downwards. The preset word length threshold is 600 characters, so that maximum length input of downstream large model data extraction is met to the greatest extent, all small fragments are combined to form a final fragment result, and three complete fragment results are input to downstream.

And (six) ES searching.

And searching the problem of the user in the form of keywords (or keywords) based on the open source ES framework, taking a second similar segment with the highest keyword coincidence degree, splicing the second similar segment into the first similar segment determined based on the vector similarity, and enriching search contents.

And (seventh) extracting a summary.

Inputting the target segment into a large model, and extracting answers (namely target keywords) in the target segment through the extraction operation of the large model.

And inputting answers corresponding to the three target fragments into a fine-tuned large model, so that the large model judges the most likely answer, extracts the content of the answer, and the large model fine-tunes the form of providing question-answer pairs by service personnel, relates to the knowledge of most document contents, and improves the summarizing output capability of the large model for the enterprise-standard documents.

Through the embodiment, the position of the answer in the document can be accurately positioned, and a user can conveniently and quickly position the answer. The retrieval uses a large model for sorting and content extraction, so that the retrieved answers are more accurate.

It should be noted that the method of the embodiments of the present disclosure may be performed by a single device, such as a computer or a server. The method of the embodiment can also be applied to a distributed scene, and is completed by mutually matching a plurality of devices. In the case of such a distributed scenario, one of the devices may perform only one or more steps of the methods of embodiments of the present disclosure, the devices interacting with each other to accomplish the methods.

It should be noted that the foregoing describes some embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments described above and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

Based on the same inventive concept, the present disclosure also provides a document searching apparatus corresponding to the method of any of the above embodiments.

Referring to fig. 3, the document searching apparatus includes:

a similar segment determining module 301 configured to receive a document searching problem input by a user, and determine a plurality of similar segments corresponding to the problem from a pre-stored document;

A target segment determination module 302 configured to determine a first preset number of target segments from a plurality of the similar segments;

A positioning information determining module 303 configured to determine positioning information of the target segment in the affiliated document;

the analysis processing module 304 is configured to analyze and process the target fragment to obtain a fragment result;

an output module 305 configured to output the positioning information and the fragment result.

the target segment determination module 302 includes:

The sorting processing unit is configured to input a plurality of similar fragments into the neural network model, sort the similar fragments according to the degree of correlation with the problem by using the neural network model, and obtain a plurality of sorted similar fragments;

and the target fragment determining unit is configured to determine the first preset number of target fragments from the sequenced multiple similar fragments.

the analysis processing module 304 includes:

The splicing processing unit is configured to splice the target segment and the upper segment and/or the lower segment corresponding to the target segment to obtain a spliced segment;

And the extraction and summarization unit is configured to extract and summarize the spliced fragments through the neural network model to obtain the fragment results.

In some embodiments, the splice processing unit includes:

a context segment determining subunit configured to determine a context segment corresponding to the target segment and/or a context segment corresponding to the target segment;

A total segment word length determination subunit configured to determine a total segment word length of the target segment and the above segment and/or the below segment corresponding to the target segment;

And the splicing processing subunit is configured to determine that the total segment word length is smaller than or equal to a preset word length threshold, and splice the target segment with the upper segment and/or the lower segment corresponding to the target segment to obtain the spliced segment.

In some embodiments, the context segment determination subunit is specifically configured to:

determining whether a title exists in a first preset word length in the affiliated document before the target fragment;

Responding to the existence of a title in a first preset word length in front of the target segment, and taking the text between the title and the target segment as the above segment;

responding to the condition that no title exists in a first preset word length before the target segment, and taking the text in the first preset word length before the target segment as the above segment;

and/or the number of the groups of groups,

Determining whether a title exists in a second preset word length behind the target segment in the affiliated document;

Responding to the existence of a title in a second preset word length behind the target segment, and taking the text between the title and the target segment as the lower Wen Pianduan;

and responding to the condition that no title exists in a second preset word length behind the target segment, and taking the text in the second preset word length behind the target segment as the following segment.

In some embodiments, the extraction summary unit comprises:

the extraction processing subunit is configured to perform keyword extraction processing on the spliced segments through the neural network model to obtain target keywords;

And the summarizing and processing subunit is configured to summarize and combine the target keywords through the neural network model to obtain the fragment result.

In some embodiments, the apparatus further comprises: a document storage module;

the document storage module includes:

A document fragment acquisition unit configured to acquire a text fragment of the document;

An information determination unit configured to determine text information in the text segment; and/or determining a segment vector according to the text segment and the text title corresponding to the text segment; and/or determining segment positioning information of the text segment;

and a storage unit configured to store at least one of the document, the text information, the segment vector, and the segment positioning information.

In some embodiments, the similar segment determination module 301 includes:

A document fragment determination unit configured to determine a plurality of document fragments in the document;

The first similar segment determining unit is configured to respectively perform similar operation processing on the problem vector of the problem and the segment vectors of the plurality of document segments through a proximity classification algorithm to obtain the similarity of the problem and each document segment, and determine a second preset number of first similar segments from the plurality of document segments according to the similarity;

A second similar segment determining unit configured to perform a coincidence degree operation process on a question keyword of the question and segment keywords of the plurality of document segments, respectively, to obtain a coincidence degree of the question and each document segment, and determine a third preset number of second similar segments from the plurality of document segments according to the coincidence degree;

And a similar segment determining unit configured to combine the second preset number of first similar segments and the third preset number of second similar segments to form a plurality of the similar segments.

For convenience of description, the above devices are described as being functionally divided into various modules, respectively. Of course, the functions of the various modules may be implemented in the same one or more pieces of software and/or hardware when implementing the present disclosure.

The device of the foregoing embodiment is configured to implement the corresponding document searching method in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, which is not described herein.

Based on the same inventive concept, the present disclosure also provides an electronic device corresponding to the method of any embodiment, including a memory, a processor, and a computer program stored on the memory and capable of running on the processor, where the processor implements the method of searching documents according to any embodiment when executing the program.

Fig. 4 shows a more specific hardware architecture of an electronic device according to this embodiment, where the device may include: a processor 1010, a memory 1020, an input/output interface 1030, a communication interface 1040, and a bus 1050. Wherein processor 1010, memory 1020, input/output interface 1030, and communication interface 1040 implement communication connections therebetween within the device via a bus 1050.

The processor 1010 may be implemented by a general-purpose CPU (Central Processing Unit ), a microprocessor, an Application SPECIFIC INTEGRATED Circuit (ASIC), or one or more integrated circuits, etc. for executing related programs to implement the technical solutions provided in the embodiments of the present disclosure.

The Memory 1020 may be implemented in the form of ROM (Read Only Memory), RAM (Random Access Memory ), static storage, dynamic storage, etc. Memory 1020 may store an operating system and other application programs, and when the embodiments of the present specification are implemented in software or firmware, the associated program code is stored in memory 1020 and executed by processor 1010.

The input/output interface 1030 is used to connect with an input/output module for inputting and outputting information. The input/output module may be configured as a component in a device (not shown) or may be external to the device to provide corresponding functionality. Wherein the input devices may include a keyboard, mouse, touch screen, microphone, various types of sensors, etc., and the output devices may include a display, speaker, vibrator, indicator lights, etc.

Communication interface 1040 is used to connect communication modules (not shown) to enable communication interactions of the present device with other devices. The communication module may implement communication through a wired manner (such as USB (Universal Serial Bus, universal serial bus), network cable, etc.), or may implement communication through a wireless manner (such as mobile network, WIFI (WIRELESS FIDELITY, wireless network communication technology), bluetooth, etc.).

Bus 1050 includes a path for transferring information between components of the device (e.g., processor 1010, memory 1020, input/output interface 1030, and communication interface 1040).

It should be noted that although the above-described device only shows processor 1010, memory 1020, input/output interface 1030, communication interface 1040, and bus 1050, in an implementation, the device may include other components necessary to achieve proper operation. Furthermore, it will be understood by those skilled in the art that the above-described apparatus may include only the components necessary to implement the embodiments of the present description, and not all the components shown in the drawings.

The electronic device of the foregoing embodiment is configured to implement the corresponding document searching method in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, which is not described herein.

Based on the same inventive concept, the present disclosure also provides a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the document searching method according to any of the above embodiments, corresponding to the method of any of the above embodiments.

The computer readable media of the present embodiments, including both permanent and non-permanent, removable and non-removable media, may be used to implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device.

The storage medium of the above embodiment stores computer instructions for causing the computer to perform the document searching method according to any of the above embodiments, and has the advantages of the corresponding method embodiments, which are not described herein.

Those of ordinary skill in the art will appreciate that: the discussion of any of the embodiments above is merely exemplary and is not intended to suggest that the scope of the disclosure, including the claims, is limited to these examples; the technical features of the above embodiments or in the different embodiments may also be combined under the idea of the present disclosure, the steps may be implemented in any order, and there are many other variations of the different aspects of the embodiments of the present disclosure as described above, which are not provided in details for the sake of brevity.

Additionally, well-known power/ground connections to Integrated Circuit (IC) chips and other components may or may not be shown within the provided figures, in order to simplify the illustration and discussion, and so as not to obscure the embodiments of the present disclosure. Furthermore, the devices may be shown in block diagram form in order to avoid obscuring the embodiments of the present disclosure, and this also accounts for the fact that specifics with respect to implementation of such block diagram devices are highly dependent upon the platform on which the embodiments of the present disclosure are to be implemented (i.e., such specifics should be well within purview of one skilled in the art). Where specific details (e.g., circuits) are set forth in order to describe example embodiments of the disclosure, it should be apparent to one skilled in the art that embodiments of the disclosure can be practiced without, or with variation of, these specific details. Accordingly, the description is to be regarded as illustrative in nature and not as restrictive.

While the present disclosure has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of those embodiments will be apparent to those skilled in the art in light of the foregoing description. For example, other memory architectures (e.g., dynamic RAM (DRAM)) may use the embodiments discussed.

The disclosed embodiments are intended to embrace all such alternatives, modifications and variances which fall within the broad scope of the appended claims. Accordingly, any omissions, modifications, equivalents, improvements, and the like, which are within the spirit and principles of the embodiments of the disclosure, are intended to be included within the scope of the disclosure.

Claims

1. A document searching method, the method comprising:

analyzing and processing the target fragment to obtain a fragment result;

And outputting the positioning information and the fragment result.

2. The method as recited in claim 1, further comprising: pre-training to obtain a neural network model for segment sequencing;

The determining a first preset number of target segments from the plurality of similar segments includes:

Inputting a plurality of similar fragments into the neural network model, and sorting the similar fragments according to the degree of correlation with the problem by using the neural network model to obtain a plurality of sorted similar fragments;

and determining the first preset number of target fragments from the sequenced multiple similar fragments.

3. The method as recited in claim 1, further comprising: pre-training to obtain a neural network model for segment extraction summary;

The analyzing and processing the target fragment to obtain a fragment result comprises the following steps:

splicing the target segment with the above segment and/or the below segment corresponding to the target segment to obtain a spliced segment;

and extracting and summarizing the spliced fragments through the neural network model to obtain the fragment results.

4. A method according to claim 3, wherein the splicing the target segment with the above segment and/or the below segment corresponding to the target segment to obtain a spliced segment includes:

determining the above segment corresponding to the target segment and/or the below segment corresponding to the target segment;

Determining the total segment word length of the upper segment and/or the lower segment corresponding to the target segment;

Determining that the total segment word length is smaller than or equal to a preset word length threshold, and performing splicing treatment on the target segment and the upper segment and/or the lower segment corresponding to the target segment to obtain the spliced segment.

5. The method of claim 4, wherein the determining the above segment corresponding to the target segment and/or the below segment corresponding to the target segment comprises:

and/or the number of the groups of groups,

6. A method according to claim 3, wherein the extracting and summarizing the spliced segments by the neural network model to obtain the segment results comprises:

Performing keyword extraction processing on the spliced segments through the neural network model to obtain target keywords;

and summarizing and combining the target keywords through the neural network model to obtain the fragment result.

7. The method of claim 1, wherein the pre-storing the document comprises:

Acquiring a text fragment of the document;

Determining text information in the text segment; and/or determining a segment vector according to the text segment and the text title corresponding to the text segment; and/or determining segment positioning information of the text segment;

At least one of the document, the text information, the segment vector, and the segment locating information is stored.

8. The method of claim 1, wherein the determining a plurality of similar segments corresponding to the problem from a pre-stored document comprises:

determining a plurality of document snippets in the document;

respectively carrying out similar operation processing on the problem vector of the problem and the fragment vectors of the plurality of document fragments through a proximity classification algorithm to obtain the similarity of the problem and each document fragment, and determining a second preset number of first similar fragments from the plurality of document fragments according to the similarity;

Performing coincidence degree operation processing on the problem keywords of the problem and the segment keywords of the plurality of document segments respectively to obtain the coincidence degree of the problem and each document segment, and determining a third preset number of second similar segments from the plurality of document segments according to the coincidence degree;

Combining the second preset number of first similar segments with the third preset number of second similar segments to form a plurality of similar segments.

9. A document searching apparatus, characterized by comprising:

10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of any one of claims 1 to 8 when the program is executed.