CN113806491A

CN113806491A - Information processing method, device, equipment and medium

Info

Publication number: CN113806491A
Application number: CN202111143741.1A
Authority: CN
Inventors: 李舒; 周永鹏
Original assignee: Comac Software Co ltd; Shanghai Aviation Industry Group Co ltd
Current assignee: Comac Software Co ltd; Shanghai Aviation Industry Group Co ltd
Priority date: 2021-09-28
Filing date: 2021-09-28
Publication date: 2021-12-17
Anticipated expiration: 2041-09-28
Also published as: CN113806491B

Abstract

The application provides a method, a device, equipment and a medium for information processing, wherein the method comprises the following steps: acquiring a query request sent by a query terminal; the query request carries a query text; according to the keywords in the query text and the first label of each document stored in a search library, carrying out similarity ranking on the documents in the search library; for each document, determining a target fragment text with similarity meeting preset requirements with the query text according to a second label corresponding to each fragment in the document; and sequencing the target fragment text of each document and the similarity of the documents, and sending the sequence to a query terminal so that the query terminal displays the target fragment text of each document according to the similarity sequence. By adopting the method, the problem that the retrieval result is not accurate enough is solved.

Description

Information processing method, device, equipment and medium

Technical Field

The present application relates to the field of information processing, and in particular, to a method, an apparatus, a device, and a medium for information processing.

Background

In the process of technological progress, scientific research data can be continuously accumulated, scientific research resources are basically stored in a database in a numerical form so as to be convenient for users to look up in the later production process, a search engine is developed as soon as possible in order to look up the data in the database, and the search engine is a system which collects information from the internet by using a specific computer program according to a certain strategy, provides retrieval service for the users after organizing and processing the information, and displays the retrieved related information to the users.

However, in the existing method of looking up data through a search engine, basically, a user inputs a query text, and a retrieval system directly provides a document related to the query text.

Disclosure of Invention

In view of the above, an object of the present application is to provide an information processing method, apparatus, device and medium, which are used to solve the problem in the prior art that search results are not accurate enough.

In a first aspect, an embodiment of the present application provides an information processing method, where the method includes:

acquiring a query request sent by a query terminal; the query request carries a query text;

according to the keywords in the query text and the first label of each document stored in a search library, carrying out similarity ranking on the documents in the search library;

for each document, determining a target fragment text with similarity meeting preset requirements with the query text according to a second label corresponding to each fragment in the document;

and sequencing the target fragment text of each document and the similarity of the documents, and sending the sequence to a query terminal so that the query terminal displays the target fragment text of each document according to the similarity sequence.

In one possible embodiment, the first tag of the document includes a second tag corresponding to each fragmented text in the document and a third tag corresponding to a headline of the document;

the second label comprises any one or more of the following words: the small-title key words and the text key words of the fragment texts, the first associated words with the small-title key words and the second associated words with the text key words are related;

the third label includes the following words: a headline keyword of the document and a third associated word having a correlation with the headline keyword.

In one possible embodiment, the first tag of each document stored in the search repository is obtained by:

for each document in the search library, fragmenting the document according to a preset segmentation requirement to obtain at least one fragmented text;

and integrating the second label of each fragment text in the document and the third label corresponding to the headline of the document into the first label of the document aiming at each document in the search library.

In one possible embodiment, the second label of the fragmented text is determined by:

determining at least one keyword of the fragment text based on content information of the fragment text;

determining relevant words having correlation with the keywords according to the similarity between the keywords and each candidate word in a relevant word bank;

and determining the keywords and associated words with correlation to the keywords as second labels of the fragment texts.

In a possible embodiment, for each document in the search library, performing fragmentation processing on the document according to a preset segmentation requirement to obtain at least one fragmented text, including:

for each document in the search library, if the document comprises a subheading, fragmenting the document according to the subheading to obtain at least one fragmented text; and if the document does not comprise the subtitles, fragmenting the document according to the segments to obtain at least one fragmented text.

In one possible real-time solution, the performing similarity ranking on the documents in the search base according to the keywords in the query text and the first tag of each document stored in the search base includes:

for each document stored in the search library, calculating the vocabulary similarity of each vocabulary in the first label of the document and the keywords in the query text;

for each document stored in the search library, calculating the similarity between the keywords in the query text and the document of each document stored in the search library according to the calculated vocabulary similarity corresponding to each vocabulary in the first label of the document and the weight of each vocabulary in the first label of the document;

and according to the document similarity of each document, carrying out similarity ranking on the documents in the search library.

In one possible embodiment, the method further comprises:

and sending the keywords in the query text to a query terminal so that the query terminal can highlight the keywords in the query text contained in the target fragment text of each document.

In a second aspect, an embodiment of the present application provides an information processing apparatus, including:

the acquisition module is used for acquiring the query request sent by the query terminal; the query request carries a query text;

the ranking module is used for ranking the similarity of the documents in the search library according to the keywords in the query text and the first label of each document stored in the search library;

the determining module is used for determining a target fragment text with similarity meeting preset requirements with the query text according to the second label corresponding to each fragment in each document;

and the sending module is used for sequencing the target fragment text of each document and the similarity of the documents and sending the sequence to the query terminal so that the query terminal displays the target fragment text of each document according to the similarity sequence.

In one possible embodiment, the first tag of each document stored in the search pool in the ranking unit is obtained by:

In a possible embodiment, the second label of the fragment text in the determination unit is determined by:

In a possible embodiment, the ranking module, when configured to rank the documents in the search base according to the similarity between the keywords in the query text and the first tag of each document stored in the search base, is specifically configured to:

In one possible embodiment, the apparatus further comprises:

and the display unit is used for sending the keywords in the query text to a query terminal so that the query terminal can highlight the keywords in the query text contained in the target fragment text of each document.

In a third aspect, an embodiment of the present application provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and when the processor executes the computer program, the steps of the method in the first aspect are implemented.

In a fourth aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, performs the steps of the method according to the first aspect.

According to the method and the device, the query request which is sent by the query terminal and carries the query text is obtained, and therefore the content which the user wants to search at the query terminal is determined according to the query text in the query request. And according to the keywords in the query text and the first label of each document stored in the search library, performing similarity sequencing on the documents in the search library, so that the documents are displayed on a display terminal in sequence according to the similarity of the documents and the query text. And for each document, determining a target fragment text with the similarity meeting preset requirements with the query text according to the second label corresponding to each fragment in the document. And sequencing the target fragment text of each document and the similarity of the documents, and sending the sequence to a query terminal so that the query terminal displays the target fragment text of each document according to the similarity sequence.

Through the method, the target fragment text with high similarity to the query text is displayed on the display terminal, the search result can be accurate to a specific section or paragraph, and the target fragment text and the document corresponding to the target fragment text are displayed according to the similarity from high to low.

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.

Fig. 1 is a schematic flowchart of an information processing method according to an embodiment of the present application;

FIG. 2 is a flowchart illustrating a method for determining a first tag of a document according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of an information processing apparatus according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application.

Fig. 1 is a schematic flowchart of an information processing method according to an embodiment of the present application, and as shown in fig. 1, the method is implemented by the following steps:

step 101, acquiring a query request sent by a query terminal; the query request carries query text.

Specifically, the query text may be a whole document, or a word or a sentence, and the content of the query text is the content that the user wants to search through the query terminal. The query request comprises query text, time for submitting the query text, information of the user and the like, and also comprises one or more of document types selected by the user. The document types may include: information, bar, encyclopedia, library, web page. Specifically, when the user does not set the document search range, the search is performed in all documents in the search library by default.

And 102, carrying out similarity sequencing on the documents in the search library according to the keywords in the query text and the first label of each document stored in the search library.

Specifically, the keywords in the query text may be marked when the system implements marking or querying, or may be obtained by the system through retrieval and extraction according to big data or a search library. For example, when the query text is "how much the operating voltage of the single chip microcomputer is", if the first label about "single chip microcomputer" and "voltage" is marked in advance in the search library, the keywords in the query text are determined as "single chip microcomputer" and "operating voltage". Or, a keyword extraction formula or an extraction model is set for the query text, so that key words such as verbs and nouns in the query text are extracted. In the embodiment of the application, the documents in the retrieval library can be classified according to the first tag or the field corresponding to the documents, so that the user can perform classified retrieval during retrieval or search, and the data processing amount of the system is reduced. The first label of the document comprises a second label corresponding to each fragment text in the document and a third label corresponding to a headline of the document; the second label comprises any one or more of the following words: the small-title key words and the text key words of the fragment texts, the first associated words with the small-title key words and the second associated words with the text key words are related; the third label includes the following words: a headline keyword of the document and a third associated word having a correlation with the headline keyword.

According to the determined key words in the query text, determining the similarity between the key words and the first tags of all the documents stored in the search library, according to the key words and the first tags, determining the similarity between the documents in the search library and the query text, and sequencing the documents in the search library according to the numerical value of the similarity.

Step 103, determining a target fragment text with similarity meeting preset requirements with the query text according to the second label corresponding to each fragment in each document.

Specifically, each fragment in the document corresponds to the key information of a section or a paragraph in the document, and the finer the fragment in the document is divided, the more the second tag of the document is stored in the search library.

And for each document, determining the similarity between the document and the query text according to the second label marked for the document and the keywords in the query text and the query text, and taking the text formed by the vocabulary, sentences or paragraphs in the document, the similarity of which with the query text meets the preset requirement, as the target fragment text.

And 104, sequencing the target fragment text of each document and the similarity of the documents, and sending the sequence to a query terminal so that the query terminal displays the target fragment text of each document according to the similarity sequence.

Specifically, after the target fragment text with the similarity to the query text meeting the preset requirement is obtained in step 103, the similarity between the target fragment text and the document corresponding to the target fragment text and the query text is ranked according to the similarity value of the query text, and the target fragment text arranged according to the similarity value and the document corresponding to each target fragment text are sent to the query terminal, so that the query terminal displays the target fragment text of each document according to the similarity ranking. In the embodiment of the application, the display terminal may have a plurality of ways of displaying the target fragment text, for example, a path capable of leading to a document corresponding to the target fragment text is provided in the display terminal; or directly positioning the region where the target fragment text is located in the document corresponding to the target fragment text, and highlighting the content in the target fragment text in the document.

Specifically, for each document, the fragment text is a text formed by content corresponding to each fragment in the document, and when each fragment is a paragraph in the document, the content in each paragraph forms a fragment text; when each fragment is a section of content in a document, all the content of each section is regarded as a fragment text, no matter how many sections of characters or images exist in each section. If information such as images and audios appears, the part of information can be deleted, or the part of information can be converted into text information through a specific means, subtitles in fragment texts, keywords in subtitles, text keywords of the fragment texts, first associated words related to the subtitles keywords, and second associated words related to the text keywords can be used as second labels of the fragment texts. The second label of the fragment text may also be at least two words and a correspondence between each word. At least one second label may be provided for each fragmented text.

In a possible implementation, fig. 2 is a flowchart illustrating a method for determining a first tag of a document according to an embodiment of the present application, where as shown in fig. 2, the first tag of each document stored in the search library is obtained by:

step 201, for each document in the search library, performing fragmentation processing on the document according to a preset segmentation requirement to obtain at least one fragmented text.

Specifically, before fragmenting the document according to a preset segmentation requirement, the method further includes: obtaining each document in a database, judging the type of the document aiming at each document in the database, and if the document is of a character type, storing the document into the search library. The documents in the database can be uploaded and acquired from public channels, or can be obtained by statistics according to big data and other modes. If the type of the document is a non-character type, the non-character type document can be arranged into a character type document through a character conversion plug-in, and then the character type document is stored in the database.

The number of documents in the search corpus is limited. For each document in a search base, fragmenting the document to divide the document into at least one fragmented text. According to the actual situation, the preset segmentation requirement is adjusted, for example, the preset segmentation requirement may be that a text composed of each paragraph in the document is used as a fragment text; it is also possible to make each text made up of the contents of each measure in the document as fragmented text. In the embodiment of the present application, the document that is judged to be the text type may also be stored in a preset area in the database, and an access interface of the preset area in the database is provided to the search library, so that the user can access the target document in the database through the interface in the search library.

Step 202, aiming at each document in the search library, integrating the second label of each fragment text in the document and the third label corresponding to the headline of the document into the first label of the document.

Specifically, after the fragmented texts are sorted out in step 201, at least one second tag and at least one third tag are marked for each fragmented text, and the corresponding second tag and third tag of the document are used as the first tag of the document.

In one possible embodiment, the second label of the fragmented text is determined by: determining at least one keyword of the fragment text based on content information of the fragment text; determining relevant words having correlation with the keywords according to the similarity between the keywords and each candidate word in a relevant word bank; and determining the keywords and associated words with correlation to the keywords as second labels of the fragment texts.

Specifically, the candidate words are words with similar or identical semantics such as similar words, synonyms, paraphrases, abbreviations and the like of each keyword in the fragmented text. The content information is all contents constituting the fragmented text, and the content information includes but is not limited to: subtitle, large title, text content, image and audio content, etc. Determining at least one keyword in the fragment text according to the content information in the fragment text, wherein the keyword comprises: a large title, a small title, a body word, and a summary word for the fragmented text. And determining the similarity between the keywords in the fragment text and each candidate word in the associated word bank aiming at each keyword determined for the fragment text, and determining the candidate words which accord with a preset association threshold value as the associated words of the keywords in the fragment text according to the obtained similarity result. And determining each keyword in the fragment text and the associated word with the correlation with each keyword as a second label of the fragment text.

In a possible embodiment, for each document in the search library, performing fragmentation processing on the document according to a preset segmentation requirement to obtain at least one fragmented text, including: for each document in the search library, if the document comprises a subheading, fragmenting the document according to the subheading to obtain at least one fragmented text; and if the document does not comprise the subtitles, fragmenting the document according to the segments to obtain at least one fragmented text.

Specifically, in the embodiment of the present application, a document is divided according to subtitles, and a text composed of each subtitle and corresponding content under the subtitle is used as a fragment text; when the document has no subtitles or only one title, the document is divided according to paragraphs, and texts formed by the contents corresponding to each paragraph are used as fragment texts. The embodiment of the application does not limit the division manner of the fragment text, and for example, the fragment text can be divided according to keywords, a description manner and description contents.

In one possible embodiment, the step 102 of ranking the similarity of the documents in the search base according to the keywords in the query text and the first tag of each document stored in the search base includes:

step 1021, calculating the vocabulary similarity between each vocabulary in the first label of the document and the keywords in the query text aiming at each document stored in the search library.

Specifically, the similarity calculation method may be to compare the vocabulary in the first tag with the vocabulary in the query text word by word, or may determine semantic similarity between the vocabulary in the first tag and the vocabulary in the query text through a model, and calculate a similarity value between the vocabulary in the first tag and the vocabulary in the query text.

Step 1022, for each document stored in the search library, according to the calculated vocabulary similarity corresponding to each vocabulary in the first tag of the document and the weight of each vocabulary in the first tag of the document, calculating the similarity between the keyword in the query text and the document of each document stored in the search library.

Specifically, the weight of the vocabulary may be preset or determined according to the tag group in which the vocabulary is located, for example, the weight of the vocabulary in the second tag is set to be different from the weight of the vocabulary in the third tag, or the vocabulary in the document is classified, and the weight of the headline is set to be the first numerical value, the weight of the subtitle is set to be the second numerical value, and the weight of the text keyword is set to be the third numerical value. After calculating the similarity value between each vocabulary in the first tags of all documents in the search base and the keywords in the query text in step 1021, calculating the similarity between the keywords in the query text and each document stored in the search base according to the calculated similarity value and weight. By adjusting the weight of each vocabulary in the first label, different document similarity ranks can be obtained.

And 1023, sorting the similarity of the documents in the search library according to the similarity of the documents.

Specifically, after the similarity between each document in the search corpus and the query text is calculated in step 1022, the documents are ranked. In the embodiment of the application, documents corresponding to the numerical value of the document similarity lower than the lowest similarity threshold are not displayed by setting a lowest similarity threshold, or words lower than the lowest similarity threshold in the first tag are removed and do not participate in document similarity calculation, so that the browsing time of a user on useless information is reduced or the calculation amount of a system is reduced.

In one possible embodiment, the method further comprises:

Specifically, at the query terminal, the keywords in the query text are displayed separately or highlighted, and the target fragment text searched for the keywords is displayed. And highlighting the vocabulary with the similarity of the keywords with the query text exceeding a preset threshold value in the target fragment text.

Fig. 3 is a schematic structural diagram of an information processing apparatus according to an embodiment of the present application, corresponding to the information processing method in fig. 1, and as shown in fig. 3, the information processing apparatus includes: an obtaining module 301, a sorting module 302, a determining module 303, and a sending module 304.

An obtaining module 301, configured to obtain a query request sent by a query terminal; the query request carries query text.

A ranking module 302, configured to rank, according to the keyword in the query text and the first tag of each document stored in the search library, the documents in the search library according to the similarity.

The determining module 303 is configured to determine, for each document, a target fragment text whose similarity to the query text meets a preset requirement according to the second tag corresponding to each fragment in the document.

A sending module 304, configured to rank the target fragment text of each document and the similarity of the document, and send the target fragment text of each document and the similarity to the query terminal, so that the query terminal displays the target fragment text of each document according to the similarity ranking.

In one possible embodiment, the first tag of the document includes a second tag corresponding to each fragmented text in the document and a third tag corresponding to a headline of the document.

The second label comprises any one or more of the following words: the fragment text comprises a subtitle keyword and a text keyword of the fragment text, a first associated word having correlation with the subtitle keyword, and a second associated word having correlation with the text keyword.

In a possible embodiment, the first tag of each document stored in the search pool in the ranking unit is obtained by the following steps.

And for each document in the search library, fragmenting the document according to a preset segmentation requirement to obtain at least one fragmented text.

In a possible embodiment, the second label of the fragment text in the determination unit is determined by the following steps.

Determining at least one keyword of the fragment text based on content information of the fragment text.

And determining the associated words having correlation with the keywords according to the similarity between the keywords and each candidate word in the associated word library.

for each document stored in the search base, calculating the vocabulary similarity of each vocabulary in the first label of the document and the keywords in the query text.

And aiming at each document stored in the search library, calculating the similarity between the keywords in the query text and the document of each document stored in the search library according to the calculated vocabulary similarity corresponding to each vocabulary in the first label of the document and the weight of each vocabulary in the first label of the document.

In one possible embodiment, the apparatus further comprises:

Corresponding to the method in fig. 1, a computer device 400 is further provided in the embodiments of the present application, and fig. 4 is a schematic structural diagram of a computer device provided in the embodiments of the present application, where the device includes a memory 401, a processor 402, and a computer program stored in the memory 401 and executable on the processor 402, where the processor 402 implements the information processing method when executing the computer program.

Specifically, the memory 401 and the processor 402 can be general-purpose memory and processor, which are not limited in particular, and when the processor 402 runs the computer program stored in the memory 401, the method for processing information can be executed, so as to solve the problem in the prior art that the search result is not accurate enough.

Corresponding to an information processing method in fig. 1, the present application further provides a computer-readable storage medium, on which a computer program is stored, and the computer program is executed by a processor to perform the steps of the information processing method.

Specifically, the storage medium can be a general storage medium, such as a mobile disk, a hard disk, and the like, and when a computer program on the storage medium is executed, the method for processing information can be executed, so that the problem that a search result in the prior art is not accurate enough is solved.

In the embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments provided in the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus once an item is defined in one figure, it need not be further defined and explained in subsequent figures, and moreover, the terms "first", "second", "third", etc. are used merely to distinguish one description from another and are not to be construed as indicating or implying relative importance.

Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present application, and are used for illustrating the technical solutions of the present application, but not limiting the same, and the scope of the present application is not limited thereto, and although the present application is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope disclosed in the present application; such modifications, changes or substitutions do not depart from the spirit and scope of the present disclosure, which should be construed in light of the above teachings. Are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of information processing, comprising:

2. The method of claim 1, wherein the first tag of the document comprises a second tag corresponding to each fragmented text in the document and a third tag corresponding to a headline of the document;

3. The method of claim 1, wherein the first tag for each document stored in the corpus is obtained by:

4. The method of claim 3, wherein the second label of the fragmented text is determined by:

5. The method of claim 3, wherein fragmenting each document in the search corpus according to a preset segmentation requirement to obtain at least one fragmented text comprises:

6. The method of claim 1, wherein the similarity ranking of the documents in the search corpus according to the keywords in the query text and the first tag of each document stored in the search corpus comprises:

7. The method of claim 1, further comprising:

8. An information processing apparatus, comprising:

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method of any of the preceding claims 1-7 are implemented by the processor when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, is adapted to carry out the steps of the method according to any one of the preceding claims 1 to 7.