CN117763084A

CN117763084A - Knowledge base retrieval method based on text compression and related equipment

Info

Publication number: CN117763084A
Application number: CN202311715572.3A
Authority: CN
Inventors: 徐馨兰; 杨哲超; 石丽娟
Original assignee: China Telecom Technology Innovation Center; China Telecom Corp Ltd
Current assignee: China Telecom Technology Innovation Center; China Telecom Corp Ltd
Priority date: 2023-12-13
Filing date: 2023-12-13
Publication date: 2024-03-26

Abstract

The disclosure provides a knowledge base retrieval method based on text compression and related equipment, and relates to the technical field of natural language processing. The method comprises the steps of obtaining a question to be answered; searching a plurality of text messages corresponding to the questions to be answered in a pre-constructed knowledge base according to the questions to be answered; compressing the text messages according to the similarity between the questions to be answered and the text messages, and determining the compressed text messages; inputting the questions to be answered and the compressed text information into a pre-trained large language model, and outputting answers of the questions to be answered. According to the method and the device, the text information retrieved from the knowledge base is compressed, the length of the text information in the input large-scale language model is reduced, the large-scale language model can answer questions according to more comprehensive text information, and the problem that the accuracy of answering the questions is affected due to the fact that the input of the large-scale language model is limited by the spliced text in the knowledge base is solved.

Description

Knowledge base retrieval method based on text compression and related equipment

Technical Field

The disclosure relates to the technical field of natural language processing, in particular to a knowledge base retrieval method based on text compression and related equipment.

Background

A large language model is an artificial intelligence model intended to understand and generate human language, which is trained on a vast amount of text data, and can perform a wide range of tasks, including text summarization, translation, emotion analysis, and the like. The construction of large models in the professional field is an important means for digital transformation of the telecommunication industry, such as the construction of large models of networks, and can help the networks to advance from intelligence to higher levels. The large model learning knowledge of the communication industry can be applied to a plurality of telecom operation scenes, such as service opening, operation and maintenance guarantee, network and service quality optimization and the like. The large model enhances various aspects of the capability of the operation system, and gradually becomes an important core of the cloud network operation system in the future.

In the prior art, a large model (large language model) in the professional field is built mainly by using a mode of mounting a knowledge base by using a basic large model, namely, vector retrieval is carried out on a query text by using the large model and a vectorization tool, and corpus in the related field is returned and added into the input of the large model for reference to form an answer. However, the input length supported by the large model in the professional field is limited and is limited by the length selection of the spliced text in the knowledge base, so that the large model cannot answer questions according to more comprehensive documents, and the accuracy of answering the questions is affected.

It should be noted that the information disclosed in the above background section is only for enhancing understanding of the background of the present disclosure and thus may include information that does not constitute prior art known to those of ordinary skill in the art.

Disclosure of Invention

The present disclosure provides a text compression-based knowledge base retrieval method and related apparatus, which at least to some extent overcomes the problem that the accuracy of answering questions is affected by the fact that large language model input is limited to spliced text in the knowledge base in the related art.

Other features and advantages of the present disclosure will be apparent from the following detailed description, or may be learned in part by the practice of the disclosure.

According to one aspect of the present disclosure, there is provided a text compression-based knowledge base retrieval method, including: acquiring a question to be answered; according to the questions to be answered, searching in a pre-constructed knowledge base to obtain a plurality of text messages corresponding to the questions to be answered; compressing the plurality of text messages according to the similarity between the questions to be answered and the plurality of text messages, and determining the compressed plurality of text messages; inputting the questions to be answered and the compressed text messages into a pre-trained large language model, and outputting answers to the questions to be answered.

In some embodiments, the pre-built knowledge base comprises: splitting each long text in the historical knowledge base text according to semantic logic to generate a plurality of first short texts; cleaning and standardizing the first short texts to obtain second short texts; and extracting the second short texts corresponding to the keyword splicing of each second short text, and determining a pre-constructed knowledge base.

In some embodiments, the compressing the plurality of text messages according to the similarity between the question to be answered and the plurality of text messages, and determining the compressed plurality of text messages includes: respectively carrying out vectorization representation on the questions to be answered and the text information, and determining a question vector to be answered and a plurality of text information vectors; respectively carrying out similarity calculation on the to-be-answered question vector and the text information vectors to determine the similarity of each text information and the to-be-answered question; respectively judging whether the similarity between each text message and the question to be answered is greater than a preset threshold value; if the similarity is smaller than a preset threshold, compressing the corresponding text information; and if the similarity is greater than a preset threshold, reserving the corresponding text information.

In some embodiments, the compressing the corresponding text information includes: determining the number of compressed words according to the similarity and the number of corresponding text information words; and compressing the corresponding text information according to the limitation of the number of the compressed words to the number of the words.

In some embodiments, said vectorizing the questions to be answered with the plurality of text information comprises: and inputting the questions to be answered and the text information into a pre-trained semantic vector model, and outputting the questions to be answered and the text information vectors in the same space.

In some embodiments, the pre-trained semantic vector model comprises: constructing an unlabeled corpus data set and a labeled question-answer data set of a knowledge base in the appointed field; performing unsupervised pre-training on the semantic vector model to obtain a training semantic vector model; and performing supervised fine tuning training on the training semantic vector model to obtain a pre-trained semantic vector model.

In some embodiments, the method further comprises: determining ordering information according to the value of the similarity between each text message and the question to be answered; and inputting the sequenced text information and the questions to be answered into a pre-trained large language model according to the sequencing information, and outputting answers of the questions to be answered.

According to another aspect of the present disclosure, there is also provided a knowledge base retrieval device based on text compression, including: the question obtaining module to be answered is used for obtaining the questions to be answered; the knowledge base retrieval module is used for retrieving a plurality of text messages corresponding to the questions to be answered from a pre-constructed knowledge base according to the questions to be answered; the text information compression module is used for compressing the plurality of text information according to the similarity between the questions to be answered and the plurality of text information, and determining the compressed plurality of text information; and the answer output module is used for inputting the questions to be answered and the compressed text information into a pre-trained large language model and outputting the answers of the questions to be answered.

According to another aspect of the present disclosure, there is also provided an electronic device including: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform the text compression-based knowledge base retrieval method of any of the above via execution of the executable instructions.

According to another aspect of the present disclosure, there is also provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the text compression-based knowledge base retrieval method of any one of the above.

According to another aspect of the present disclosure, there is also provided a computer program product comprising a computer program which, when executed by a processor, implements the text compression-based knowledge base retrieval method of any of the above.

The method for searching the knowledge base based on text compression provided by the embodiment of the disclosure comprises the steps of obtaining a question to be answered; searching a plurality of text messages corresponding to the questions to be answered in a pre-constructed knowledge base according to the questions to be answered; compressing the text messages according to the similarity between the questions to be answered and the text messages, and determining the compressed text messages; inputting the questions to be answered and the compressed text information into a pre-trained large language model, and outputting answers of the questions to be answered. According to the method and the device, the text information retrieved from the knowledge base is compressed, the length of the text information in the input large-scale language model is reduced, the large-scale language model can answer questions according to more comprehensive text information, and the problem that the accuracy of answering the questions is affected due to the fact that the input of the large-scale language model is limited by the spliced text in the knowledge base is solved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure. It will be apparent to those of ordinary skill in the art that the drawings in the following description are merely examples of the disclosure and that other drawings may be derived from them without undue effort.

FIG. 1 is a schematic diagram of a system architecture of a text compression-based knowledge base retrieval method in an embodiment of the disclosure;

FIG. 2 illustrates a flow chart of a text compression-based knowledge base retrieval method in an embodiment of the disclosure;

FIG. 3 is a flowchart showing a specific example of a text compression-based knowledge base retrieval method in an embodiment of the disclosure;

FIG. 4 is a flowchart of yet another embodiment of a text compression-based knowledge base retrieval method in an embodiment of the disclosure;

FIG. 5 is a flowchart illustrating yet another embodiment of a text compression-based knowledge base retrieval method in an embodiment of the disclosure;

FIG. 6 illustrates a schematic diagram of a text compression-based knowledge base retrieval device in an embodiment of the disclosure;

FIG. 7 is a diagram showing a specific example of a system configuration of a text compression-based knowledge base retrieval method in an embodiment of the disclosure

FIG. 8 illustrates a block diagram of a computer device in an embodiment of the present disclosure;

fig. 9 shows a schematic diagram of a computer-readable storage medium in an embodiment of the disclosure.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus a repetitive description thereof will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in software or in one or more hardware modules or integrated circuits or in different networks and/or processor devices and/or microcontroller devices.

For ease of understanding, before describing embodiments of the present disclosure, several terms referred to in the embodiments of the present disclosure are first explained as follows:

LLM: large Language Model, large language model;

BGE: BAAI General Embedding, chinese and English semantic vector models;

pre-tracking: pre-training;

SFT: supervised Fine tuning, supervised fine tuning;

RLHF: reinforcement Learning from Human Feedback, reinforcement learning based on human feedback;

COT: chain of thinking;

ChatGPT: chat Generative Pre-trained Transformer, artificial intelligence technology driven natural language processing tools;

ChatGLM: chat General Language Model, chat generic language model;

NLP: natural Language Processing, natural language processing.

The following detailed description of embodiments of the present disclosure refers to the accompanying drawings.

FIG. 1 illustrates an exemplary application system architecture diagram to which the text compression-based knowledge base retrieval method of embodiments of the present disclosure may be applied. As shown in fig. 1, the system architecture may include a terminal device 101, a network 102, and a server 103.

The medium used by the network 102 to provide a communication link between the terminal device 101 and the server 103 may be a wired network or a wireless network.

Alternatively, the wireless network or wired network described above uses standard communication techniques and/or protocols. The network is typically the Internet, but may be any network including, but not limited to, a local area network (Local Area Network, LAN), metropolitan area network (Metropolitan Area Network, MAN), wide area network (Wide Area Network, WAN), mobile, wired or wireless network, private network, or any combination of virtual private networks. In some embodiments, data exchanged over a network is represented using techniques and/or formats including HyperText Mark-up Language (HTML), extensible markup Language (Extensible MarkupLanguage, XML), and the like. All or some of the links may also be encrypted using conventional encryption techniques such as secure sockets layer (Secure Socket Layer, SSL), transport layer security (Transport Layer Security, TLS), virtual private network (Virtual Private Network, VPN), internet security protocol (Internet Protocol Security, IPSec), etc. In other embodiments, custom and/or dedicated data communication techniques may also be used in place of or in addition to the data communication techniques described above.

The terminal device 101 may be a variety of electronic devices including, but not limited to, smart phones, tablet computers, laptop portable computers, desktop computers, smart speakers, smart watches, wearable devices, augmented reality devices, virtual reality devices, and the like.

Alternatively, the clients of the applications installed in different terminal devices 101 are the same or clients of the same type of application based on different operating systems. The specific form of the application client may also be different based on the different terminal platforms, for example, the application client may be a mobile phone client, a PC client, etc.

The server 103 may be a server providing various services, such as a background management server providing support for devices operated by the user with the terminal apparatus 101. The background management server can analyze and process the received data such as the request and the like, and feed back the processing result to the terminal equipment.

Optionally, the server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs (Content Delivery Network, content delivery networks), basic cloud computing services such as big data and artificial intelligence platforms, and the like.

In one example, a server obtains a question to be answered sent by a terminal device; the server retrieves a plurality of text messages corresponding to the questions to be answered from a pre-constructed knowledge base according to the questions to be answered; the server compresses the text messages according to the similarity between the questions to be answered and the text messages, and determines the compressed text messages; the server inputs the questions to be answered and the compressed text messages into a pre-trained large language model, and outputs answers of the questions to be answered.

Those skilled in the art will appreciate that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative, and that any number of terminal devices, networks, and servers may be provided as desired. The embodiments of the present disclosure are not limited in this regard.

Under the system architecture, the embodiment of the disclosure provides a knowledge base searching method based on text compression, which can be executed by any electronic device with computing processing capability.

In some embodiments, the text compression-based knowledge base retrieval method provided in the embodiments of the present disclosure may be performed by a terminal device of the above system architecture; in other embodiments, the text compression-based knowledge base retrieval method provided in the embodiments of the present disclosure may be performed by a server in the system architecture described above; in other embodiments, the text compression-based knowledge base searching method provided in the embodiments of the present disclosure may be implemented by the terminal device and the server in the system architecture in an interactive manner.

Fig. 2 shows a flowchart of a text compression-based knowledge base searching method in an embodiment of the disclosure, and as shown in fig. 2, the text compression-based knowledge base searching method provided in the embodiment of the disclosure includes the following steps:

s202, obtaining a question to be answered.

It should be noted that the question to be answered may be a question that is required to be answered or interpreted, for example, a question posed by the user, "is long-term sleep deficiency lost? ".

S204, according to the questions to be answered, searching a plurality of text messages corresponding to the questions to be answered in a pre-constructed knowledge base.

It should be noted that, the knowledge base may be a knowledge cluster, for example, an intelligent database or an artificial intelligent database, and in particular, the knowledge base is a structured, easy-to-operate, easy-to-use and comprehensive and organized knowledge cluster that is constructed for solving a problem in a certain field in knowledge engineering.

S206, compressing the plurality of text messages according to the similarity between the questions to be answered and the plurality of text messages, and determining the compressed plurality of text messages.

It should be noted that the similarity may be a similarity measure, i.e. a measure for comprehensively evaluating the similarity between two things. Things can be classified by quantitative methods, the closer two things are, the greater their similarity measure, and the farther two things are, the less their similarity measure. The quantitative manner can be distance, angle and correlation coefficient. The compression can be to compress the speech segments (text information), namely, the long speech segments with rich contents are concentrated into phrase segments with concise and clear languages according to preset requirements.

S208, inputting the questions to be answered and the compressed text information into a pre-trained large language model, and outputting answers to the questions to be answered.

It should be noted that the large language model may be an artificial intelligence model for understanding and generating human language. Large language models are trained on large amounts of text data and can perform a wide range of tasks including text summarization, translation, emotion analysis, and the like. Large language models are characterized by a large scale, containing more than one hundred million levels of parameters, which help them learn complex patterns in language data. Large language models are typically based on deep learning architectures, such as translators. Because the questions to answer the user need to understand the user's intent and give satisfactory answers, in one example, a large model of the ChatGPT type, such as ChatGLM, may be chosen to be used.

According to the method and the device, the text information retrieved from the knowledge base is compressed, the length of the text information in the input large-scale language model is reduced, the large-scale language model can answer questions according to more comprehensive text information, and the problem that the accuracy of answering the questions is affected due to the fact that the input of the large-scale language model is limited by the spliced text in the knowledge base is solved.

In one specific example, the pre-built knowledge base includes: splitting each long text in the historical knowledge base text according to semantic logic to generate a plurality of first short texts; cleaning and standardizing the first short texts to obtain second short texts; and extracting the second short texts corresponding to the keyword splicing of each second short text, and determining a pre-constructed knowledge base.

For example, from the existing knowledge base text, splitting the long text into paragraphs or sentences according to semantic logic to generate a short text of a paragraph (corresponding to the first short text); cleaning and normalizing the short text (corresponding to the first short text) including removing special symbols, punctuation, stop words, etc. for subsequent processing and analysis; and then extracting keywords from each text segment, splicing the keywords in the front to obtain a plurality of text messages (corresponding to the second short text), and constructing a plurality of text message sets, namely a pre-constructed knowledge base.

For example, existing knowledge base text includes:

first text: 1. western medicine treatment: the patient can take medicines orally, the male patient can take finasteride, the female patient can take spironolactone, and the external use of minoxidil tincture and other medicines at the scalp can be considered, if the patient has endocrine abnormality, the timely taking of medicines is recommended to improve the hormone secretion abnormality; 2. and (3) treating traditional Chinese medicine: the patient can take some traditional Chinese medicines with the alopecia preventing effect, such as radix sileris, polygonum multiflorum, fructus forsythiae, poria cocos and the like according to the doctor's advice, and the patient is assisted in treating alopecia; 3. hair transplantation treatment, etc.;

Second text: the academy of sciences has found that 1000 hair loss people have the common characteristics: alcoholism, stay up, obesity, stress, etc.;

third text: ensuring sleep time has many benefits, such as preventing alopecia, good qi and blood, etc., so we must go to sleep well, etc.;

fourth text: the king is an excellent person who has many advantages such as no hair loss, good sleep, long-term exercise, good performance in sunlight, work effort, etc.;

fifth text: some alopecia is caused by congenital inheritance, is genetically determined, and is gender and lifestyle dependent, among others.

The pre-constructed knowledge base comprises, according to the existing knowledge base text:

first text: keyword: treating alopecia. Western medicine treatment: the patient can take medicines orally, the male patient can take finasteride, the female patient can take spironolactone, and the external use of minoxidil tincture and other medicines at the scalp can be considered, if the patient has endocrine abnormality, the timely taking of medicines is recommended to improve the hormone secretion abnormality; and (3) treating traditional Chinese medicine: the patient can take some traditional Chinese medicines with the alopecia preventing effect, such as radix sileris, polygonum multiflorum, fructus forsythiae, poria cocos and the like according to the doctor's advice, and the patient is assisted in treating alopecia; hair-planting treatment ";

Second text: keyword: and (5) investigation of alopecia. The academy of sciences has found that 1000 hair loss people have the common characteristics: alcoholism, stay up, obesity, stress, and the like;

third text: keyword: sleep benefits. Ensuring sleep time has many benefits, such as preventing alopecia, good qi and blood, etc., so we must go to sleep well;

fourth text: keyword: the advantage of the king. The king is an excellent person, and has many advantages such as no alopecia, good sleep, long-term exercise, good performance, sunshine and work effort;

fifth text: keyword: the cause of alopecia. Some alopecia is caused by congenital inheritance, is genetically determined, and is gender and lifestyle dependent.

In an embodiment of the present disclosure, as shown in fig. 3, the text compression-based knowledge base searching method provided in the embodiment of the present disclosure may compress a plurality of text information according to the similarity between a question to be answered and the plurality of text information, determine the compressed plurality of text information, convert the text information into a vector representation, determine whether to compress the text information through the similarity calculation between vectors, and may accurately process the text information, thereby improving the accuracy of answering the question in a large model:

S302, respectively vectorizing the questions to be answered and a plurality of text information, and determining the vectors of the questions to be answered and the plurality of text information;

in one embodiment of the present disclosure, vectorizing a question to be answered with a plurality of text information includes: and inputting the questions to be answered and the text information into a pre-trained semantic vector model, and outputting the questions to be answered vector and the text information vector in the same space.

The pre-trained semantic vector model may be an Embedding model, where Embedding refers to a process of mapping high-dimensional data (e.g., text, picture, audio) into a low-dimensional space in machine learning and natural language processing. An embedded vector is typically a vector of real numbers that represents the input data as points in a continuous numerical space. The Embedding model comprises a BGE model.

S304, similarity calculation is carried out on the to-be-answered question vector and the text information vectors respectively, and the similarity of each text information and the to-be-answered question is determined;

s306, judging whether the similarity between each text message and the question to be answered is greater than a preset threshold value or not;

S3061, if the similarity is smaller than a preset threshold, compressing the corresponding text information;

and S3062, if the similarity is larger than a preset threshold value, reserving corresponding text information.

The symmetrical semantic retrieval (synonymous sentence matching) aims at finding similar sentences, and the symmetrical semantic retrieval is naturally matched with the vector retrieval based on the principle of calculating the vector similarity, and only the model has stronger content abstraction capability. But asymmetric semantic retrieval (question-answer pair matching) requires that the model be able to map the questions and answers to the same space.

For example, when a user presents a query question, the question is converted to a vector representation, i.e., the question is mapped to one query vector in the same space as the short text vector using the BGE model. Similarity calculation is performed on the query vector and each slice short text vector (corresponding to the short text vector) in the knowledge base, and optionally, the similarity measurement calculation is performed by using the following formula 1:

wherein n represents vector latitude, c represents short text vector, Q represents query vector, sim represents similarity, and i represents vector ith latitude.

Specifically, for example, the user presents a problem: "does long-term sleep insufficiency go to hair loss? ";

the user-posed question is subjected to similarity calculation with the following slice (corresponding to the plurality of text information):

First slice R1: [ text one: keyword, alopecia treatment. 1. Western medicine treatment: the patient can take medicines orally, the male patient can take finasteride, the female patient can take spironolactone, and the external use of minoxidil tincture and other medicines at the scalp can be considered, if the patient has endocrine abnormality, the timely taking of medicines is recommended to improve the hormone secretion abnormality; 2. and (3) treating traditional Chinese medicine: the patient can take some traditional Chinese medicines with the alopecia preventing effect, such as radix sileris, polygonum multiflorum, fructus forsythiae, poria cocos and the like according to the doctor's advice, and the patient is assisted in treating alopecia; 3. hair treatment, etc. "similarity: 0.51];

second slice R2: [ text two: keyword, hair loss investigation. Through research and study of 1000 alopecia people, the academy of sciences finds that the common characteristics of the academy of sciences are that a plurality of factors such as alcoholism, stay up, obesity, stress and the like participate in the study, and the similarity is as follows: 0.81];

third slice R3: [ text three: "keyword-sleep benefit. Ensuring sleep time has many benefits, such as preventing alopecia, good qi and blood, etc., so we must go to sleep well ", similarity: 0.52];

fourth slice R4: [ text four: "keyword: the advantage of the king. The king is an excellent person, and has many advantages such as no hair loss, good sleep, long-term exercise, good performance, sunlight, work effort ", similarity: 0.39];

Fifth slice R5: text five: "keyword: the cause of alopecia. Some hair loss is a result of congenital inheritance, is genetically determined, and is gender and lifestyle dependent, "similarity: 0.9].

In one embodiment of the present disclosure, as shown in fig. 4, the text compression-based knowledge base retrieval method provided in the embodiment of the present disclosure may determine a pre-trained semantic vector model through the following steps, and by combining unsupervised pre-training with supervised fine tuning training, the asymmetric semantic retrieval capability in a specified field may be enhanced, and the accuracy of answering questions by a large model may be improved:

s402, constructing an unlabeled corpus data set and a labeled question-answer data set of a knowledge base in a specified field;

s404, performing unsupervised pre-training on the semantic vector model to obtain a training semantic vector model;

s406, performing supervised fine tuning training on the training semantic vector model to obtain a pre-trained semantic vector model.

It should be noted that the unsupervised pre-training uses a large amount of unlabeled data to perform unsupervised pre-training on the model. The supervised fine tuning is performed on the basis of a pre-trained model. The model is pre-trained using a large unlabeled dataset and then supervised fine-tuned using a labeled dataset related to the target task. In supervised fine tuning, the model is trained in a supervised learning manner. The labeled input samples and corresponding expected outputs are used to adjust parameters of the model. The goal is to adapt to a particular task by fine-tuning the weights and parameters of the model over the specific task. The embedded vector model is capable of mapping short text to vector representations in a high-dimensional space, preserving semantic information between the text, and measuring the relevance of the data using the distance between the vectors. And converting the short text into a specific space semantic vector by using the trimmed BGE model as an embedded vector model, and storing the specific space semantic vector into a vector database.

For example, according to the question "does long-term sleep insufficiency suffer from alopecia? If no keyword is added and no supervised fine tuning is performed, the following three texts and similarities are searched:

text one: the causes of alopecia are: alcoholism, stay up, stress, etc. Statistical similarity to questions: 0.3100.

text II: if it can be ensured that the sleeping time is free from some diseases, such as alopecia. Statistical similarity to questions: 0.3625.

text III: the king has many advantages such as no alopecia, good sleep and long-term exercise. Similarity to problem statistics: 0.4020.

the text I and the text II are obviously more suitable to be used as answers, and the text III and the text overlapping with the questions are more, but the statistical similarity is higher, so that the asymmetric semantic retrieval capability in the appointed field needs to be enhanced for making the questions and answers more intelligent and accurate. Firstly, an unlabeled corpus data set and a labeled question-answer data set of a knowledge base in a specified field are constructed, then, an unsupervised pretraining is carried out on a BGE model, and finally, a supervised fine tuning training is carried out.

Examples of effects after fine tuning:

is there a problem according to the question "is there alopecia due to sleep insufficiency for a long period? ", the following three texts are retrieved:

Text one: the causes of alopecia are: alcoholism, stay up, stress, etc. Statistical similarity to questions: 0.6502.

text II: if it can be ensured that the sleeping time is free from some diseases, such as alopecia. Statistical similarity to questions: 0.4325.

text III: the king has many advantages such as no alopecia, good sleep and long-term exercise. Similarity to problem statistics: 0.2120.

thus, after fine tuning, the information which is large and related to the problem can be found through the similarity.

Meanwhile, the method and the device can balance the interference of the context through the addition of the keywords. It should be noted that, keywords are extracted from text by a model. And calculating the similarity between the keywords of each text and the problem by extracting the keywords.

After the keyword is added, the similarity of the three texts of the keyword is searched, wherein the similarity comprises the following three texts:

text one: cause of alopecia, similarity to problem: 0.5111.

text II: sleep benefits, similarity to questions: 0.4711.

text III: the advantage of little king, similarity with the problem: 0.1122.

it can be seen that although the text one with the highest similarity to the problem is truly the closest to the problem, determining the text two also provides useful information according to the similarity between the keywords and the problem, so that some useful information with lower similarity to the problem can be avoided from being ignored by extracting the keywords.

In one example, compressing the corresponding text information includes: determining the number of compressed words according to the similarity and the number of corresponding text information words; and compressing the corresponding text information according to the limitation of the number of the compressed words to the number of the words.

For example, the original text is abbreviated (corresponding to the compression) by multiplying the similarity by the number of original text words (corresponding to the number of corresponding text information words) to limit the number of words. The text content retrieved from the domain knowledge base is contracted based on semantic weights, and the degree of contraction is hierarchical. The higher the similarity of the original text segments, the more original text content remains.

For example, for a text segment having a similarity greater than 0.88 (a preset threshold value), the original text is abbreviated with the limit of the number of words by multiplying the similarity by the number of words of the original text. For text having a similarity of 0.88 or less, no abbreviation is required. For example, the contraction rearrangement is performed by the COT, the text slices (the above first slice, second slice, third slice, fourth slice, and fifth slice) are acquired, the similarity is taken as a weight value, and a similarity threshold (equivalent to the above preset threshold) is set to filter the acquired text slices. Experiments show that the text with the similarity below 0.4 has low correlation with the problem, can be ignored, has higher general quality with high similarity, can be kept without compression, and has good text recognition effect. Therefore, a similarity low threshold=0.4 and a similarity high threshold=0.88 can be set here. And taking out the text with the similarity larger than the low threshold value as reference content. The contraction rearrangement is based on the model's own mental chain capabilities, guiding the LLM to make a step-by-step reasoning by encouraging promt. The large model is realized to adaptively process the text according to the weight so as to extract the most relevant information. The large model can obtain more knowledge input within the input length limit and is not easy to forget important content. Thereby providing a more comprehensive and useful answer to questions related to the professional field.

In one specific example, the five texts in the pre-constructed knowledge base calculate the similarity and compress to the following four sliced texts (text four similarity is 0.39 deleted):

r1: text one: the alopecia can be treated by western medicines, traditional Chinese medicines, hair-planting treatments and the like, and the similarity is: 0.51;

r2: text II: the hair loss group generally has the characteristics of alcoholism, stay up, obesity, high pressure and the like, and has the similarity: 0.8;

r3: text III: ensure sleep to prevent alopecia, similarity: 0.52;

r5: text five: some alopecia is caused by congenital inheritance, is genetically determined, and is gender and lifestyle dependent, similarity: 0.9.

and constructing the compressed R1, R2, R3 and R5 into a contracted corpus. The contracted corpus is taken as a referenceable context and is put into the model input (as input) along with the questions posed by the user.

The large language model has a problem of forgetting the intermediate input section. Therefore, if a large model is to answer questions from a more comprehensive document, attention must be paid to ranking the questions.

In one embodiment of the present disclosure, as shown in fig. 5, the text compression-based knowledge base retrieval method provided in the embodiment of the present disclosure may sort a plurality of text information input into a large language model through the following steps, so that the large language model is not easy to forget important text content, and the accuracy and the comprehensiveness of answering questions by the large model can be enhanced:

S502, determining ordering information according to the numerical value of the similarity between each text message and the question to be answered;

it should be noted that the ranking information characterizes respective positions of a plurality of text information when the text information is input into a pre-trained large language model;

s504, according to the sorting information, inputting the sorted text information and the questions to be answered into a pre-trained large language model, and outputting answers of the questions to be answered.

When the large model generates answers, the design encourages the promt to consider the weights of different text segments in the knowledge base, and pay more attention to the knowledge content with higher similarity, and in the ordering, slice information with high similarity can be placed on two sides of input, for example: [ R5, R3, R1, R2]. And combining the contracted and rearranged corpus, so that the large model can better answer the questions related to the professional field. The method and the device reorder the contracted text according to the memory characteristics of the large model, so that the large model is not easy to forget important text contents.

In one specific example, a user question is entered: is long-term sleep insufficiency with alopecia? Information that can be referred to is: some alopecia is caused by congenital inheritance, is determined by genes, and is sex and lifestyle dependent. 2, preventing alopecia during sleep. 3, the alopecia can be treated by western medicines, chinese medicines, hair planting and the like. 4, the person suffering from alopecia generally has problems such as alcoholism, stay up, obesity, and high stress.

The output of the large model ChatGLM would be: the long-term sleep insufficiency can lead to alopecia. If the alopecia is serious, the treatment of Chinese medicine and western medicine or hair planting can be considered.

It should be noted that, in the technical solution of the present disclosure, the acquiring, storing, using, processing, etc. of data all conform to relevant regulations of national laws and regulations, and various types of data such as personal identity data, operation data, behavior data, etc. relevant to individuals, clients, crowds, etc. acquired in the embodiments of the present disclosure have been authorized.

Based on the same inventive concept, the embodiment of the disclosure also provides a knowledge base retrieval device based on text compression, as described in the following embodiment. Since the principle of solving the problem of the embodiment of the device is similar to that of the embodiment of the method, the implementation of the embodiment of the device can be referred to the implementation of the embodiment of the method, and the repetition is omitted.

Fig. 6 shows a schematic diagram of a knowledge base retrieval device based on text compression in an embodiment of the disclosure, as shown in fig. 6, the device includes: the system comprises a question obtaining module 61 to be answered, a knowledge base searching module 62, a text information compressing module 63, an answer output module 64, a knowledge base constructing module 65, a semantic vector model training module 66 and a text information ordering model 67.

The question to be answered acquisition module 61 is used for acquiring questions to be answered;

the knowledge base searching module 62 is configured to search a plurality of text information corresponding to the question to be answered in a pre-constructed knowledge base according to the question to be answered;

a text information compression module 63, configured to compress a plurality of text information according to the similarity between the question to be answered and the plurality of text information, and determine the compressed plurality of text information;

the answer output module 64 is configured to input the questions to be answered and the compressed text messages into a pre-trained large language model, and output answers to the questions to be answered.

In one embodiment of the present disclosure, the text compression-based knowledge base searching apparatus further includes a knowledge base construction module 65, configured to split each long text in the historical knowledge base text according to semantic logic, and generate a plurality of first short texts; cleaning and standardizing the first short texts to obtain second short texts; and extracting the second short texts corresponding to the keyword splicing of each second short text, and determining a pre-constructed knowledge base.

In one embodiment of the present disclosure, the text information compression module 63 is further configured to: respectively vectorizing the questions to be answered and a plurality of text information, and determining the vectors of the questions to be answered and the text information; respectively carrying out similarity calculation on the to-be-answered question vector and a plurality of text information vectors, and determining the similarity of each text information and the to-be-answered question; respectively judging whether the similarity between each text message and the question to be answered is greater than a preset threshold value; if the similarity is smaller than a preset threshold, compressing the corresponding text information; and if the similarity is greater than a preset threshold, reserving the corresponding text information.

In one embodiment of the present disclosure, the text information compression module 63 is further configured to: determining the number of compressed words according to the similarity and the number of corresponding text information words; and compressing the corresponding text information according to the limitation of the number of the compressed words to the number of the words.

In one embodiment of the present disclosure, the text information compression module 63 is further configured to: and inputting the questions to be answered and the text information into a pre-trained semantic vector model, and outputting the questions to be answered vector and the text information vector in the same space.

In one embodiment of the present disclosure, the text compression-based knowledge base searching device further includes a semantic vector model training module 66, configured to construct a label-free corpus dataset and a labeled question-answer dataset of the knowledge base in the specified domain; performing unsupervised pre-training on the semantic vector model to obtain a training semantic vector model; and performing supervised fine tuning training on the training semantic vector model to obtain a pre-trained semantic vector model.

In one embodiment of the present disclosure, the knowledge base searching device based on text compression further includes a text information ranking model 67, configured to determine ranking information according to a value of similarity between each text information and a question to be answered; and inputting the sequenced text information and the questions to be answered into a pre-trained large language model according to the sequencing information, and outputting answers of the questions to be answered.

It should be noted that, the question obtaining module 61, the knowledge base retrieving module 62, the text information compressing module 63, and the answer outputting module 64 correspond to S202 to S208 in the method embodiment, and the above modules are the same as the examples and application scenarios implemented by the corresponding steps, but are not limited to the disclosure of the method embodiment. It should be noted that the modules described above may be implemented as part of an apparatus in a computer system, such as a set of computer-executable instructions.

FIG. 7 is a schematic diagram of a specific example of a text compression-based knowledge base retrieval system in an embodiment of the disclosure, as shown in FIG. 7, comprising: domain knowledge base 71, embedding vector model 72, vector database 73, compression rearrangement module 74 and large language model 75.

Wherein the domain knowledge base 71 contains a large number of specified domain text data.

The Embedding of the embedded vector model 72 is used to map text into vectors of a particular semantic space. In order to better match this domain knowledge base, this patent fine-tunes it.

The vector database 73 is used for storing semantic vectors output by the Embedding vector model and giving indexes, and also has the function of searching according to the similarity.

The compression rearrangement module 74 adaptively compresses and rearranges the retrieved content, so that the large model can provide comprehensive and reliable answers to questions related to the professional field while having general knowledge. In particular use, the compression reordering module comprises: acquiring a weight value and acquiring a text slice; designing compression rearrangement; LLM is executed according to step reasoning; outputting the processed text slice.

The large language model 75 has the functions of understanding, reasoning and generating human language, and can solve various NLP tasks by using the existing pre-trained general model.

Those skilled in the art will appreciate that the various aspects of the present disclosure may be implemented as a system, method, or program product. Accordingly, various aspects of the disclosure may be embodied in the following forms, namely: an entirely hardware embodiment, an entirely software embodiment (including firmware, micro-code, etc.) or an embodiment combining hardware and software aspects may be referred to herein as a "circuit," module "or" system.

An electronic device 800 according to such an embodiment of the present disclosure is described below with reference to fig. 8. The electronic device 800 shown in fig. 8 is merely an example and should not be construed to limit the functionality and scope of use of embodiments of the present disclosure in any way.

As shown in fig. 8, the electronic device 800 is embodied in the form of a general purpose computing device. Components of electronic device 800 may include, but are not limited to: the at least one processing unit 810, the at least one memory unit 820, and a bus 830 connecting the various system components, including the memory unit 820 and the processing unit 810.

Wherein the storage unit stores program code that is executable by the processing unit 810 such that the processing unit 810 performs steps according to various exemplary embodiments of the present disclosure described in the above section of the present specification.

For example, the processing unit 810 may perform the following steps of the method embodiment to obtain the question to be answered; searching a plurality of text messages corresponding to the questions to be answered in a pre-constructed knowledge base according to the questions to be answered; compressing the text messages according to the similarity between the questions to be answered and the text messages, and determining the compressed text messages; inputting the questions to be answered and the compressed text information into a pre-trained large language model, and outputting answers of the questions to be answered.

The storage unit 820 may include readable media in the form of volatile storage units, such as Random Access Memory (RAM) 8201 and/or cache memory 8202, and may further include Read Only Memory (ROM) 8203.

Storage unit 820 may also include a program/utility 8204 having a set (at least one) of program modules 8205, such program modules 8205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.

Bus 830 may be one or more of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device 800 may also communicate with one or more external devices 840 (e.g., keyboard, pointing device, bluetooth device, etc.), one or more devices that enable a user to interact with the electronic device 800, and/or any device (e.g., router, modem, etc.) that enables the electronic device 800 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 850. Also, electronic device 800 may communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet, through network adapter 860. As shown, network adapter 860 communicates with other modules of electronic device 800 over bus 830. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with electronic device 800, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.

From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, including several instructions to cause a computing device (may be a personal computer, a server, a terminal device, or a network device, etc.) to perform the method according to the embodiments of the present disclosure.

In particular, according to embodiments of the present disclosure, the process described above with reference to the flowcharts may be implemented as a computer program product comprising: and a computer program which, when executed by the processor, implements the text compression-based knowledge base retrieval method described above.

In an exemplary embodiment of the present disclosure, a computer-readable storage medium, which may be a readable signal medium or a readable storage medium, is also provided. Fig. 9 illustrates a schematic diagram of a computer-readable storage medium … in an embodiment of the present disclosure, where a program product capable of implementing the method of the present disclosure is stored on the computer-readable storage medium 900 as shown in fig. 9. In some possible implementations, various aspects of the disclosure may also be implemented in the form of a program product comprising program code for causing a terminal device to carry out the steps according to the various exemplary embodiments of the disclosure as described in the "exemplary methods" section of this specification, when the program product is run on the terminal device.

More specific examples of the computer readable storage medium in the present disclosure may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

In this disclosure, a computer readable storage medium may include a data signal propagated in baseband or as part of a carrier wave, with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Alternatively, the program code embodied on a computer readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

In particular implementations, the program code for carrying out operations of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).

It should be noted that although in the above detailed description several modules or units of a device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit in accordance with embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.

Furthermore, although the steps of the methods in the present disclosure are depicted in a particular order in the drawings, this does not require or imply that the steps must be performed in that particular order or that all illustrated steps be performed in order to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform, etc.

From the description of the above embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, including several instructions to cause a computing device (may be a personal computer, a server, a mobile terminal, or a network device, etc.) to perform the method according to the embodiments of the present disclosure.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any adaptations, uses, or adaptations of the disclosure following the general principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

Claims

1. A text compression-based knowledge base retrieval method, comprising:

acquiring a question to be answered;

according to the questions to be answered, searching in a pre-constructed knowledge base to obtain a plurality of text messages corresponding to the questions to be answered;

compressing the plurality of text messages according to the similarity between the questions to be answered and the plurality of text messages, and determining the compressed plurality of text messages;

inputting the questions to be answered and the compressed text messages into a pre-trained large language model, and outputting answers to the questions to be answered.

2. The text compression-based knowledge base retrieval method as claimed in claim 1, wherein said pre-built knowledge base comprises:

splitting each long text in the historical knowledge base text according to semantic logic to generate a plurality of first short texts;

cleaning and standardizing the first short texts to obtain second short texts;

and extracting the second short texts corresponding to the keyword splicing of each second short text, and determining a pre-constructed knowledge base.

3. The text compression-based knowledge base retrieval method as recited in claim 1, wherein said compressing the plurality of text information according to the similarity between the question to be answered and the plurality of text information, determining the compressed plurality of text information comprises:

Respectively carrying out vectorization representation on the questions to be answered and the text information, and determining a question vector to be answered and a plurality of text information vectors;

respectively carrying out similarity calculation on the to-be-answered question vector and the text information vectors to determine the similarity of each text information and the to-be-answered question;

respectively judging whether the similarity between each text message and the question to be answered is greater than a preset threshold value;

if the similarity is smaller than a preset threshold, compressing the corresponding text information;

and if the similarity is greater than a preset threshold, reserving the corresponding text information.

4. A method of retrieving a text-based compressed knowledge base according to claim 3, wherein said compressing the corresponding text information comprises:

determining the number of compressed words according to the similarity and the number of corresponding text information words;

and compressing the corresponding text information according to the limitation of the number of the compressed words to the number of the words.

5. A text compression based knowledge base retrieval method as claimed in claim 3, wherein said vectorizing said questions to be answered with said plurality of text information comprises:

and inputting the questions to be answered and the text information into a pre-trained semantic vector model, and outputting the questions to be answered and the text information vectors in the same space.

6. The text compression based knowledge base retrieval method as recited in claim 5, wherein said pre-trained semantic vector model comprises:

constructing an unlabeled corpus data set and a labeled question-answer data set of a knowledge base in the appointed field;

performing unsupervised pre-training on the semantic vector model to obtain a training semantic vector model;

and performing supervised fine tuning training on the training semantic vector model to obtain a pre-trained semantic vector model.

7. The text compression-based knowledge base retrieval method as claimed in any one of claims 1 to 6, further comprising:

determining ordering information according to the value of the similarity between each text message and the question to be answered;

and inputting the sequenced text information and the questions to be answered into a pre-trained large language model according to the sequencing information, and outputting answers of the questions to be answered.

8. A text compression-based knowledge base retrieval apparatus, comprising:

the question obtaining module to be answered is used for obtaining the questions to be answered;

the knowledge base retrieval module is used for retrieving a plurality of text messages corresponding to the questions to be answered from a pre-constructed knowledge base according to the questions to be answered;

The text information compression module is used for compressing the plurality of text information according to the similarity between the questions to be answered and the plurality of text information, and determining the compressed plurality of text information;

and the answer output module is used for inputting the questions to be answered and the compressed text information into a pre-trained large language model and outputting the answers of the questions to be answered.

9. An electronic device, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the text compression based knowledge base retrieval method of any one of claims 1-7 via execution of the executable instructions.

10. A computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the text compression based knowledge base retrieval method of any of claims 1-7.