CN117891898B - Classification retrieval method and system based on GPT large model - Google Patents

Classification retrieval method and system based on GPT large model Download PDF

Info

Publication number
CN117891898B
CN117891898B CN202410056662.4A CN202410056662A CN117891898B CN 117891898 B CN117891898 B CN 117891898B CN 202410056662 A CN202410056662 A CN 202410056662A CN 117891898 B CN117891898 B CN 117891898B
Authority
CN
China
Prior art keywords
data
model
search
gpt
user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410056662.4A
Other languages
Chinese (zh)
Other versions
CN117891898A (en
Inventor
张群轼
姜守义
邢波波
李华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Ruihang Zhizhen Technology Co ltd
Original Assignee
Beijing Ruihang Zhizhen Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Ruihang Zhizhen Technology Co ltd filed Critical Beijing Ruihang Zhizhen Technology Co ltd
Priority to CN202410056662.4A priority Critical patent/CN117891898B/en
Publication of CN117891898A publication Critical patent/CN117891898A/en
Application granted granted Critical
Publication of CN117891898B publication Critical patent/CN117891898B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a classification retrieval method and a classification retrieval system based on a GPT large model, which belong to the technical field of information management, wherein the method comprises the steps of preprocessing original data; obtaining preprocessed data; the preprocessed data comprise preprocessed training data and preprocessed data to be retrieved; marking the preprocessed training data; establishing a data evaluation model, and evaluating the marked training data to obtain an evaluation result; training data optimization according to the evaluation result; obtaining optimized training data; performing first fine adjustment on a pre-constructed GPT model through optimized training data; acquiring data to be retrieved, inputting the data to be retrieved into the first fine-tuned GPT model, and acquiring a retrieval result; acquiring user feedback, and further optimizing the first fine-tuned GPT model according to the user feedback; by marking, evaluating and optimizing the training data and utilizing the feedback of the user, the accuracy and the efficiency of the classified retrieval are improved.

Description

Classification retrieval method and system based on GPT large model
Technical Field
The application relates to the technical field of information management, in particular to a classification retrieval method and system based on a GPT large model.
Background
With the advent of the big data age, information retrieval technology has become an important means for people to obtain information. Most of the traditional information retrieval methods are based on keyword matching, and although the method is simple, the accuracy and the efficiency of the method are difficult to meet the demands of users when the method faces a large amount of data. Therefore, how to improve the accuracy and efficiency of information retrieval becomes a hot problem of current research.
In recent years, the rapid development of deep learning technology, especially the appearance of a generated pre-training language model GPT, brings new opportunities for the development of information retrieval technology. The GPT model can learn the internal structure and semantic information of the language by pre-training a large amount of unlabeled data, so that the GPT model is excellent in natural language processing tasks. However, how to apply the GPT model to the classification search and improve the accuracy and efficiency of the search remains a challenging problem.
Most of the existing classification retrieval methods based on GPT models only pay attention to the optimization of the models, but neglect the management and optimization of training data. Indeed, the quality and diversity of the annotation of training data has a crucial impact on the performance of the model. Meanwhile, the existing method also lacks of utilization of user feedback, and dynamic adjustment and optimization of the model cannot be performed according to actual demands and feedback of the user.
Aiming at the problems, the invention provides a classification retrieval method based on a GPT large model. The method improves the accuracy and efficiency of classified retrieval by marking, evaluating and optimizing the training data and utilizing the feedback of the user.
Disclosure of Invention
The application aims to provide a classification retrieval method and a classification retrieval system based on a GPT large model.
The application adopts the following technical scheme:
the application provides a classification retrieval method based on a GPT large model, which comprises the following steps:
Preprocessing the original data; obtaining preprocessed data; the preprocessed data comprise preprocessed training data and preprocessed data to be retrieved;
Marking the preprocessed training data;
establishing a data evaluation model, and evaluating the marked training data to obtain an evaluation result; training data optimization according to the evaluation result; obtaining optimized training data;
Performing first fine adjustment on a pre-constructed GPT model through optimized training data;
Acquiring data to be retrieved, inputting the data to be retrieved into the first fine-tuned GPT model, and acquiring a retrieval result;
acquiring user feedback, and further optimizing the first fine-tuned GPT model according to the user feedback; the user feedback includes:
search content in which the repetition number of the same search by any user exceeds a preset number threshold in a first preset time period;
And the corresponding search content when the negative feedback times of the plurality of users for the same type search exceeds a second threshold value within a second preset time.
Further, the pre-processed training data is marked based on the classification retrieval method of the GPT large model; comprising the following steps:
randomly transmitting the preprocessed data to a background terminal for first labeling;
Acquiring first labeling data of a background terminal; inputting the first annotation data as a training set into an annotation training model;
Performing second labeling on the residual data through a labeling training model;
randomly transmitting the second labeling data to a plurality of background terminals for inspection; obtaining a plurality of test results; wherein each test result includes a plurality of different indicators;
Averaging the same index of the plurality of test results, and comparing the same index with a preset threshold value of the corresponding index to obtain a first comparison result;
And adjusting the optimization labeling training model according to the first results.
Further, a data evaluation model is established based on the classification retrieval method of the GPT large model, and the marked training data is evaluated to obtain an evaluation result; comprising the following steps:
Establishing a data evaluation model, and evaluating the marked data through the evaluation model;
The data evaluation model is as follows:
P is the evaluation model score; n i is the training number of each label, and m is the total number of labels of the training data; delta N is the difference of the training data quantity of any two labels; a median of a plurality of different tag training data amounts N Z; n Z a minimum of a number of different tag training data amounts; m is the total number of preset labels; the accuracy of the Z label; alpha is a coefficient, ranging from (0, 1); w1, w2 and w3 are weights;
f is the frequency of simultaneous occurrence of any two different labels in the same document or data point;
F max is the maximum value of the frequency of simultaneous occurrence of any two different tags in the same document or data point;
f min is the minimum of the frequency of simultaneous occurrence of any two different labels in the same document or data point; abs () is absolute value;
if P is lower than the preset first threshold value, training data is increased or acquired again.
Further, according to the classification retrieval method based on the GPT large model, the data to be retrieved is obtained, and the data to be retrieved is input into the first fine-tuned GPT model to obtain a retrieval result; comprising the following steps:
Based on the first fine-tuned GPT model, a semantic vector space is established by calculating the similarity between texts;
Acquiring a text to be searched, inputting the text to be searched into the first fine-tuned GPT model, and acquiring a semantic representation vector corresponding to the text to be searched as a vector to be searched; matching the similarity between the vector to be searched and the text vector in the semantic vector space to obtain a search result;
And according to the classification labels, the search results are ordered and displayed according to a certain rule.
Further, the method for classifying and searching based on the GPT large model comprises the following steps:
setting a first preset time period; acquiring search content of which the repetition number of the same search by any user exceeds a preset number threshold value in a first preset time period; taking the corresponding search content as first feedback;
the first preset time period is as follows:
T is a first preset time period; l is the length of the information currently retrieved by the user; t Z is the average time length of each search obtaining result in the user history; l Z is the average information length of the result obtained by each search in the user history record; k y is a preset frequency threshold; k y is more than or equal to 3; beta is the current network congestion coefficient;
Setting a second preset time period; obtaining negative feedback of each user on similar searches in a second preset time period, wherein the similar searches are search sets with search result similarity being greater than or equal to a preset similarity value;
counting the number of negative feedback of each type of search in the plurality of similar searches in a second preset time period;
If the number of times of searching negative feedback of a certain type in the second preset time period exceeds a second threshold value, performing GPT model fine adjustment by taking the searching content of the corresponding type as second feedback;
Wherein the second threshold is Y:
C Z a total number of such searches within a second preset time period; f y presets the ratio of the negative feedback times of the single category to the total number of searching times of the corresponding single category. C y is the total number of preset individual category search times in the second preset time period.
The application provides a classification retrieval system based on a GPT large model, which comprises:
the preprocessing module is used for preprocessing the original data; obtaining preprocessed data; the preprocessed data comprise preprocessed training data and preprocessed data to be retrieved;
The marking module is used for marking the preprocessed training data;
The evaluation module is used for establishing a data evaluation model and evaluating the marked training data to obtain an evaluation result; training data optimization according to the evaluation result; obtaining optimized training data;
the first fine tuning module is used for carrying out first fine tuning on the pre-constructed GPT model through the optimized training data;
the classification retrieval module is used for acquiring data to be retrieved, inputting the data to be retrieved into the first fine-tuned GPT model, and obtaining a retrieval result;
the feedback optimization module is used for acquiring user feedback and further optimizing the first fine-tuned GPT model according to the user feedback; the user feedback includes:
search content in which the repetition number of the same search by any user exceeds a preset number threshold in a first preset time period;
And the corresponding search content when the negative feedback times of the plurality of users for the same type search exceeds a second threshold value within a second preset time.
Further, the classification retrieval system based on the GPT large model, the labeling module comprises:
randomly transmitting the preprocessed data to a background terminal for first labeling;
Acquiring first labeling data of a background terminal; inputting the first annotation data as a training set into an annotation training model;
Performing second labeling on the residual data through a labeling training model;
randomly transmitting the second labeling data to a plurality of background terminals for inspection; obtaining a plurality of test results; wherein each test result includes a plurality of different indicators;
Averaging the same index of the plurality of test results, and comparing the same index with a preset threshold value of the corresponding index to obtain a first comparison result;
And adjusting the optimization labeling training model according to the first results.
Further, the classification retrieval system based on the GPT large model, the evaluation module comprises:
Establishing a data evaluation model, and evaluating the marked data through the evaluation model;
The data evaluation model is as follows:
P is the evaluation model score; n i is the training number of each label, and m is the total number of labels of the training data; delta N is the difference of the training data quantity of any two labels; a median of a plurality of different tag training data amounts N Z; n Z a minimum of a number of different tag training data amounts; m is the total number of preset labels; the accuracy of the Z label; alpha is a coefficient, ranging from (0, 1); w1, w2 and w3 are weights;
f is the frequency of simultaneous occurrence of any two different labels in the same document or data point;
F max is the maximum value of the frequency of simultaneous occurrence of any two different tags in the same document or data point;
f min is the minimum of the frequency of simultaneous occurrence of any two different labels in the same document or data point; abs () is absolute value;
if P is lower than the preset first threshold value, training data is increased or acquired again.
Further, a classification retrieval system based on a GPT big model, the classification retrieval module comprising:
Based on the first fine-tuned GPT model, a semantic vector space is established by calculating the similarity between texts;
Acquiring a text to be searched, inputting the text to be searched into the first fine-tuned GPT model, and acquiring a semantic representation vector corresponding to the text to be searched as a vector to be searched; matching the similarity between the vector to be searched and the text vector in the semantic vector space to obtain a search result;
And according to the classification labels, the search results are ordered and displayed according to a certain rule.
Further, the classification retrieval system based on the GPT large model, wherein the obtaining of the user feedback comprises the following steps:
setting a first preset time period; acquiring search content of which the repetition number of the same search by any user exceeds a preset number threshold value in a first preset time period; taking the corresponding search content as first feedback;
the first preset time period is as follows:
T is a first preset time period; l is the length of the information currently retrieved by the user; t Z is the average time length of each search obtaining result in the user history; l Z is the average information length of the result obtained by each search in the user history record; k y is a preset frequency threshold; k y is more than or equal to 3; beta is the current network congestion coefficient;
Setting a second preset time period; obtaining negative feedback of each user on similar searches in a second preset time period, wherein the similar searches are search sets with search result similarity being greater than or equal to a preset similarity value;
counting the number of negative feedback of each type of search in the plurality of similar searches in a second preset time period;
If the number of times of searching negative feedback of a certain type in the second preset time period exceeds a second threshold value, performing GPT model fine adjustment by taking the searching content of the corresponding type as second feedback;
Wherein the second threshold is Y:
C Z a total number of such searches within a second preset time period; f y presets the ratio of the negative feedback times of the single category to the total number of searching times of the corresponding single category. C y is the total number of preset individual category search times in the second preset time period.
The beneficial effects of the invention include: the quality and the readability of the data can be greatly improved through preprocessing steps such as removing noise, word segmentation, removing stop words and the like. This is critical to the accuracy of the subsequent data labeling, model training, and retrieval results. Marking the preprocessed training data, and establishing a data evaluation model to evaluate the marked data, so that the reliability and quality of the training data can be ensured. The data optimization is carried out according to the evaluation result, so that the accuracy of training data can be further improved, and powerful guarantee is provided for model training. By fine tuning the pre-constructed GPT model by using the optimized training data, the classification retrieval performance of the model can be improved. The fine tuning process can be adjusted according to specific requirements of a specific field, so that the model is better adapted to data distribution and retrieval requirements of the specific field. Obtaining user feedback and further optimizing the model is an important element of the method. By analyzing the repetition times of the same search and the negative feedback times of the same search, the requirements and the behavior patterns of the user can be deeply known, so that the model is pertinently optimized, and the accuracy of classified search and the satisfaction degree of the user are improved. The method can timely acquire user feedback and dynamically adjust model parameters, so that the classification retrieval result meets the real-time requirements of the user. At the same time, the dynamic adjustment mechanism is also helpful to deal with the change of data distribution and the change of user behavior mode, and keeps the continuous optimization and updating of the model. By analyzing the user behavior data and acquiring feedback, the method can provide more personalized search service for the user. For example, based on the user's search history and preferences, the system may recommend content of related fields or topics to meet the user's personalized needs. Such personalized services help to improve user satisfaction and loyalty.
Drawings
Fig. 1 is a flow chart of a classification retrieval method based on a GPT big model according to an embodiment of the present application.
Detailed Description
The present application will be further described with reference to the accompanying drawings and detailed description, wherein it is to be understood that, on the premise of no conflict, the following embodiments or technical features may be arbitrarily combined to form new embodiments.
Referring to fig. 1, an embodiment of the present application provides a classification retrieval method based on a GPT big model, where the method includes:
Preprocessing the original data; obtaining preprocessed data; the preprocessing comprises removing noise, word segmentation and stop words; the original data comprise training data and data to be retrieved; converting the original data into a format acceptable by a GPT large model;
Marking the preprocessed training data;
establishing a data evaluation model, and evaluating the marked training data to obtain an evaluation result; training data optimization according to the evaluation result; obtaining optimized training data;
Performing first fine adjustment on a pre-constructed GPT model through optimized training data;
Acquiring data to be retrieved, inputting the data to be retrieved into the first fine-tuned GPT model, and acquiring a retrieval result;
acquiring user feedback, and further optimizing the first fine-tuned GPT model according to the user feedback; the user feedback includes:
search content in which the repetition number of the same search by any user exceeds a preset number threshold in a first preset time period;
And the corresponding search content when the negative feedback times of the plurality of users for the same type search exceeds a second threshold value within a second preset time.
The working principle of the technical scheme is as follows: this step includes the collection, cleaning and sorting of large amounts of raw data in order to remove extraneous or erroneous information, ensuring the quality and usability of the data. The preprocessed data will be used for subsequent training and retrieval processes.
After preprocessing, the data needs to be annotated. Labeling refers to the process of adding semantic information to data so that the GP T model can understand and learn. The annotated data will be used to train the GPT model.
By establishing a data evaluation model, the quality evaluation can be carried out on the marked data. The evaluation model can be evaluated according to factors such as diversity, accuracy and representativeness of data. According to the evaluation result, the data can be further processed and optimized to improve the overall quality of the training data.
And fine tuning the pre-constructed GPT model by using the optimized training data, so that the GPT model can be better adapted to specific tasks and data distribution. The fine tuning process includes adjusting and optimizing parameters of the model to improve its performance in the class retrieval task.
When new data to be retrieved exists, the new data to be retrieved is input into a first fine-tuned GPT model, and the model can perform reasoning and analysis according to the learned semantic information and structure to generate a corresponding retrieval result. These results may be highly correlated with the data to be retrieved, or may have some correlation with it.
In order to further improve the accuracy and efficiency of the search, the system collects feedback of the user on the search result. The user feedback may include data on the number of clicks, dwell time, repetition, etc. for a certain search result. And meanwhile, the negative feedback times of the user on the similar search are also collected. According to the feedback information, the system can further optimize the first fine-tuned GPT model so as to better meet the requirements of users.
The technical scheme has the effects that: the quality and the readability of the data can be greatly improved through preprocessing steps such as removing noise, word segmentation, removing stop words and the like. This is critical to the accuracy of the subsequent data labeling, model training, and retrieval results. Marking the preprocessed training data, and establishing a data evaluation model to evaluate the marked data, so that the reliability and quality of the training data can be ensured. The data optimization is carried out according to the evaluation result, so that the accuracy of training data can be further improved, and powerful guarantee is provided for model training. By fine tuning the pre-constructed GPT model by using the optimized training data, the classification retrieval performance of the model can be improved. The fine tuning process can be adjusted according to specific requirements of a specific field, so that the model is better adapted to data distribution and retrieval requirements of the specific field. Obtaining user feedback and further optimizing the model is an important element of the method. By analyzing the repetition times of the same search and the negative feedback times of the same search, the requirements and the behavior patterns of the user can be deeply known, so that the model is pertinently optimized, and the accuracy of classified search and the satisfaction degree of the user are improved. The method can timely acquire user feedback and dynamically adjust model parameters, so that the classification retrieval result meets the real-time requirements of the user. At the same time, the dynamic adjustment mechanism is also helpful to deal with the change of data distribution and the change of user behavior mode, and keeps the continuous optimization and updating of the model. By analyzing the user behavior data and acquiring feedback, the method can provide more personalized search service for the user. For example, based on the user's search history and preferences, the system may recommend content of related fields or topics to meet the user's personalized needs. Such personalized services help to improve user satisfaction and loyalty.
According to the classification retrieval method based on the GPT large model, the preprocessed training data is marked; comprising the following steps:
randomly transmitting the preprocessed data to a background terminal for first labeling;
Acquiring first labeling data of a background terminal; inputting the first annotation data as a training set into an annotation training model;
Performing second labeling on the residual data through a labeling training model;
randomly transmitting the second labeling data to a plurality of background terminals for inspection; obtaining a plurality of test results; wherein each test result includes a plurality of different indicators;
Averaging the same index of the plurality of test results, and comparing the same index with a preset threshold value of the corresponding index to obtain a first comparison result;
And adjusting the optimization labeling training model according to the first results.
The working principle of the technical scheme is as follows: and randomly transmitting the preprocessed data to a background terminal for preliminary labeling, wherein the preliminary labeling is the first step of the labeling process. This step aims at adding preliminary semantic information to the data, providing a basis for subsequent model training. Preliminary annotation data, referred to as first annotation data, is obtained from the background terminal. These data are then used as a training set and input into the labeling training model. The labeling training model can be trained by adopting machine learning, deep learning and other technologies; and further labeling the rest data by using a labeling training model. This step is to ensure that all data is accurately and comprehensively labeled, enriching the training dataset of the model. And randomly transmitting the second marked data to a plurality of background terminals for inspection so as to evaluate the accuracy and consistency of marking. Each background terminal returns a test result including a plurality of different indexes such as marking error rate, consistency, etc. And averaging the same index of the plurality of background terminals, and comparing the same index with a preset corresponding index threshold. By this comparison, a first comparison result can be obtained. This step is to ensure the reliability and accuracy of the annotation data. And adjusting and optimizing the labeling training model according to the first comparison results. Such adjustments may include parameters of the model, structure, or training strategies, among others. The optimization aims at improving the performance and accuracy of the model in the subsequent classified retrieval task. Through the steps, the process of data annotation and the performance of the model can be continuously optimized. The iterative process is helpful to improve the accuracy and efficiency of the classified retrieval method, so that the requirements of users are better met.
The technical scheme has the effects that: through multi-round labeling, evaluation and optimization, the labeling quality of training data can be ensured. The method can more accurately identify and classify the data, and reduces labeling errors and omission, thereby improving the accuracy and reliability of subsequent model training. The marking efficiency can be remarkably improved by randomly distributing the data to the background terminal for marking and automatically marking the residual data by using a marking training model. The method can rapidly process a large amount of data, shortens the labeling time and the cost, and accelerates the development and deployment of the whole classified retrieval system. By collecting the feedback of the user on the retrieval result, the model can be adjusted and optimized in real time, and the retrieval accuracy and the user satisfaction are improved. The method can dynamically adjust according to the user demands, enhance the self-adaptive capacity of the model, and provide more personalized and accurate information retrieval service. The steps of preprocessing, labeling, evaluation, optimization and the like are comprehensively utilized, so that the classification retrieval method based on the GPT large model is more accurate, efficient and reliable on the whole. The method can meet the requirements of users on diversification and instantaneity of information retrieval, and provides powerful support for various application scenes. In summary, the classification retrieval method based on the GPT large model can remarkably improve the quality, efficiency, accuracy and overall performance of data annotation by combining dynamic adjustment of user feedback through multiple rounds of annotation, evaluation and optimization. The method has wide application prospect and value, and can bring important innovation and breakthrough to the field of information retrieval.
According to the classification retrieval method based on the GPT large model, a data evaluation model is established, and the marked training data is evaluated to obtain an evaluation result; comprising the following steps:
Establishing a data evaluation model, and evaluating the marked data through the evaluation model;
The data evaluation model is as follows:
P is the evaluation model score; n i is the training number of each label, and m is the total number of labels of the training data; delta N is the difference of the training data quantity of any two labels; a median of a plurality of different tag training data amounts N Z; n Z a minimum of a number of different tag training data amounts; m is the total number of preset labels; the accuracy of the Z label; alpha is a coefficient, ranging from (0, 1); w1, w2 and w3 are weights;
f is the frequency of simultaneous occurrence of any two different labels in the same document or data point;
F max is the maximum value of the frequency of simultaneous occurrence of any two different tags in the same document or data point;
f min is the minimum of the frequency of simultaneous occurrence of any two different labels in the same document or data point; abs () is absolute value;
if P is lower than the preset first threshold value, training data is increased or acquired again.
The working principle of the technical scheme is as follows: firstly, a data evaluation model is required to be established for objectively and comprehensively evaluating the marked training data. This evaluation model is intended to measure the diversity and representativeness of the training data and whether it can support subsequent GPT model training.
And calculating a plurality of evaluation indexes such as label accuracy, label quantity difference, median value and minimum value of label distribution and the like through a model formula. These indicators provide detailed information about the quality of the data, helping to assess the overall quality of the annotated data.
The preset first threshold is a key parameter used for judging the qualification standard of the marked data. If the evaluation model score P is below this threshold, it is indicated that the quality of the current training data is not satisfactory and that more training data needs to be added or acquired again.
If the evaluation model score P is lower than a preset first threshold, data adjustment and optimization are needed. This may include re-labeling portions of the data, adding new training data, adjusting tag distribution, etc. to ensure that the diversity and quality of the training data meets the needs of subsequent GPT model training. The first threshold may be obtained through historical data or a knowledge base;
Through the steps, the performance of the data labeling process and the model can be continuously optimized. The iterative process is helpful to improve the accuracy and efficiency of the classified retrieval method, so that the requirements of users are better met.
In summary, the classification retrieval method based on the GPT large model carries out comprehensive evaluation on the labeled training data by establishing a data evaluation model. By calculating multiple evaluation indexes and threshold comparison, the diversity and quality of the data can be objectively measured. The method is helpful for optimizing the labeling and training process and improving the accuracy and efficiency of classified retrieval.
The technical scheme has the effects that: the data evaluation model can comprehensively evaluate the marked training data, and ensure the accuracy and diversity of the data. If the evaluation score is lower than the preset threshold, the training data can be increased or acquired again, so that the overall quality of the data is guaranteed.
The index in the data evaluation model may reflect the tag distribution of the data. By adjusting and optimizing the label distribution, the representativeness of the training data can be improved, and the accuracy and the efficiency of classified retrieval can be improved.
The creation of the data evaluation model may facilitate iterative optimization. By means of weight adjustment and threshold comparison of different indexes, the data labeling and training process can be continuously optimized, and the performance and accuracy of the classification retrieval method are improved.
The data evaluation model can be built to reduce the data quantity to be marked and reduce the marking cost. Meanwhile, the frequency of re-acquiring or increasing training data can be reduced by optimizing the data quality and the label distribution, so that the cost is further reduced.
Based on high-quality training data, the GPT large model can learn and understand data characteristics better, and the accuracy of classification retrieval is improved. Meanwhile, through iterative optimization, the performance of the model can be continuously improved, and the retrieval accuracy is improved.
The establishment of the data evaluation model can help to promote generalization capability. Through evaluation and analysis of different indexes, the distribution and the characteristics of the data can be better known, so that the model is better adapted to new data and new scenes.
In summary, the classification retrieval method based on the GPT large model carries out comprehensive evaluation on the labeled training data by establishing the data evaluation model, so that the method can bring the advantages and effects of guaranteeing the data quality, optimizing the label distribution, carrying out iterative optimization, reducing the cost, improving the retrieval accuracy, enhancing the generalization capability and the like. The method is beneficial to improving the overall performance and accuracy of the information retrieval field and provides strong support for various application scenes.
According to the classified retrieval method based on the GPT large model, the data to be retrieved is obtained, and the data to be retrieved is input into the first fine-tuned GPT model to obtain a retrieval result; comprising the following steps:
Based on the first fine-tuned GPT model, a semantic vector space is established by calculating the similarity between texts;
Acquiring a text to be searched, inputting the text to be searched into the first fine-tuned GPT model, and acquiring a semantic representation vector corresponding to the text to be searched as a vector to be searched; matching the similarity between the vector to be searched and the text vector in the semantic vector space to obtain a search result;
And according to the classification labels, the search results are ordered and displayed according to a certain rule.
The working principle of the technical scheme is as follows: based on the first trimmed GPT model, a semantic vector space can be established by calculating the similarity between the texts. This process converts text into vector form so that semantic relationships between the text can be quantified and compared.
When the text to be searched is obtained, the text to be searched is input into a first fine-tuned GPT model, and a semantic representation vector corresponding to the text to be searched, which is also called a vector to be searched, is obtained through model calculation. This process converts the text to be retrieved into a form that is comparable to vectors in the semantic vector space.
And matching the similarity between the vector to be searched and the text vector in the semantic vector space to determine the text most similar to the text to be searched. The principle of similarity matching is based on the measurement modes such as cosine similarity or Euclidean distance of vectors, and the most similar text is found out to serve as a search result by calculating the similarity between the vector to be searched and the vector in the semantic vector space.
And according to the classification labels, the search results are ordered and displayed according to a certain rule. The basis for ranking may be a similarity score, a weight of the classification tags, or other relevant factors. The format of the display can be designed according to the actual requirements, such as a list, a card or a chart.
The technical scheme has the effects that: and the GPT model is utilized to carry out text similarity matching, so that the most relevant result with the text to be searched can be quickly searched. Compared with the traditional retrieval method based on keyword matching, the matching based on the semantic vector space is more efficient, and related content can be quickly found in mass data.
The GPT model is fine-tuned, so that text data can be better understood and processed, and the classified retrieval method based on the GPT model can be more accurately matched with text similarity, and accuracy of retrieval results is improved. This helps the user to find the desired information faster, improving the search satisfaction.
The classification retrieval method based on the GPT large model can process various types of text data, including long text, short text, sentences and the like. The method has wider applicability, can be applied to different fields and scenes, and meets the requirements of different users.
By creating a semantic vector space, the method can convert text into a vector form so that similarity relationships between the text can be quantified and interpreted. This helps the user understand the source and basis of the search result, increasing the trustworthiness of the method.
The classification retrieval method based on the GPT large model can be used for carrying out fine tuning and optimization according to different requirements, such as adjusting model parameters, improving a feature extraction method and the like. The method has higher flexibility, and can be adjusted and improved according to actual conditions so as to obtain better performance and effect.
In summary, the classification search method based on the GPT large model can bring benefits and effects of high efficiency, accuracy, generalization capability, interpretability, flexibility and the like by performing text similarity matching and classification search by using the GPT model after the first fine tuning. This helps to improve the overall performance and user experience in the field of information retrieval.
The method for classifying and searching based on the GPT large model in the embodiment comprises the following steps:
setting a first preset time period; acquiring search content of which the repetition number of the same search by any user exceeds a preset number threshold value in a first preset time period; taking the corresponding search content as first feedback;
the first preset time period is as follows:
T is a first preset time period; l is the length of the information currently retrieved by the user; t Z is the average time length of each search obtaining result in the user history; l Z is the average information length of the result obtained by each search in the user history record; k y is a preset frequency threshold; k y is more than or equal to 3; beta is the current network congestion coefficient;
Setting a second preset time period; obtaining negative feedback of each user on similar searches in a second preset time period, wherein the similar searches are search sets with search result similarity being greater than or equal to a preset similarity value;
the second preset time period may be daily, weekly;
counting the number of negative feedback of each type of search in the plurality of similar searches in a second preset time period;
If the number of times of searching negative feedback of a certain type in the second preset time period exceeds a second threshold value, performing GPT model fine adjustment by taking the searching content of the corresponding type as second feedback;
Wherein the second threshold is Y:
C Z a total number of such searches within a second preset time period; f y presets the ratio of the negative feedback times of the single category to the total number of searching times of the corresponding single category. C y is the total number of preset individual category search times in the second preset time period.
The working principle of the technical scheme is as follows: and calculating a first preset time period T according to parameters such as a user history retrieval record, a network congestion coefficient and the like. The first preset time period is used for screening out keywords which are repeatedly searched by a user in a period of time so as to acquire search contents of which the repeated times of the same search exceeds a preset time threshold.
And in a first preset time period, the system monitors the searching behavior of the user and records the repetition times of the user to the same searching content. And when the repetition number of a certain search content exceeds a preset number threshold, taking the search content as first feedback.
The second preset time period is to obtain negative feedback of the user on the same type of search. This time period may be daily or weekly, set as desired, to facilitate statistics and analysis of the user's feedback by the system. The system monitors the search behavior of the user within a second preset time period, and particularly focuses on search sets with search result similarity greater than or equal to a preset similarity value. For these homogeneous searches, the system will count the number of negative feedback per category.
And if the number of negative feedback times of a certain category exceeds a second threshold value, performing GPT model fine adjustment by taking the search content of the corresponding category as second feedback. The second threshold is calculated from the total number of searches and the ratio of the number of negative feedback to the number of searches for that category.
According to the first feedback and the second feedback, the system can conduct fine adjustment on the GPT model so as to optimize the performance of the model and improve the accuracy and efficiency of classification retrieval. The fine tuning may be performed by adjusting model parameters, improving feature extraction methods, and the like.
In summary, the classification search method based on the GPT big model obtains repeated search content and negative feedback of the user in a period of time by setting the first preset time period and the second preset time period, and fine-adjusts the GPT model according to the feedback, so as to improve the classification search performance and user experience. The method can continuously optimize the performance of the model according to the actual demands and feedback of the user, and improves the overall level of the information retrieval field.
The technical scheme has the effects that: by acquiring and analyzing repeated search content and negative feedback of a user, the GPT model can be finely adjusted, and the performance and accuracy of classification retrieval are optimized. This helps to improve the satisfaction of the user's search results, meeting the actual demands of the user. In general, users typically do not search for the same problem repeatedly in a short period of time unless the previous search results are perceived as unsatisfied or information is not accurate enough. This act of repeating the search may be seen as a user's dissatisfaction or uncertainty with respect to the current search results.
If the user repeatedly searches for the same content in a short time, this may mean that the previous search result is perceived as having an insufficient amount of information or an insufficient quality. Users want to obtain more comprehensive and accurate information through multiple searches.
In the field of information retrieval, the needs and behavior patterns of users may change over time. Thus, timely collection and analysis of user data is critical to understanding the current needs and behavior patterns of the user. In a preset time period, the system can collect behavior data of the user more completely, and omission or deviation of the data is avoided. Meanwhile, because the data in the time period are relatively concentrated, more accurate analysis and processing can be performed, and the quality and reliability of the data are improved. The preset time period helps to ease the pressure of data processing and analysis by processing the data in batches. For large-scale data sets, setting the time period may make data processing more efficient and controllable, preventing data overload or processing delay. By setting the preset time period, the system can collect behavior data of the user in a specific time, which helps to ensure timeliness of the data.
This repeated search behavior provides basis for model optimization. When the system detects that the number of repetitions of a search exceeds a preset threshold, this can be seen as a signal indicating that fine tuning or optimization of the model in the field is required in order to provide more accurate search results that meet the user's needs.
Based on the user's feedback (i.e., repeated searches), the system can dynamically adjust and optimize the model to better meet the user's actual needs. This feedback mechanism helps the system to learn and advance continuously, better adapting to the user's search habits and preferences.
By analyzing the repetitive search behavior of the user, the system may also provide more personalized services to the user. For example, for users who search repeatedly, the system may provide them with more detailed, more accurate information, or recommend related resources or links to help the user find the desired content faster.
Through feedback of users, parameters and feature extraction modes of the GPT model can be continuously optimized, and classification and retrieval capabilities of the model are improved. This helps to promote the overall skill level in the information retrieval arts, and promotes the development and progress of the arts.
According to the feedback and the behavior of the user, the requirements and the preferences of different users can be personalized, and more accurate search results meeting the requirements of the users can be provided. This helps to improve the user's search experience, increasing the user's confidence and dependence on the system.
Based on the processing mode of the preset time period, the feedback of the user can be obtained in real time and the fine adjustment of the model can be carried out, so that the model can adapt to and respond to the user demands and market changes in time. This helps to improve the real-time and self-adaptability of the system, and improves the competitiveness and market response capability of the system.
By setting the second threshold, the system may filter out some noise and abnormal negative feedback data. Sometimes, negative feedback from the user may be due to certain specific reasons (e.g., network problems, personal emotions, etc.), rather than problems with the model itself. The threshold may help the system eliminate these inaccurate feedback, improving the effectiveness and accuracy of the feedback. Setting the second threshold may reduce the number of search categories that need to be trimmed. If all of the homogeneous searches are fine-tuned, the computational burden on the system may be increased. By setting the threshold, the system only carries out fine adjustment on the categories with negative feedback times exceeding the threshold, so that the model can be optimized more pertinently, and the efficiency is improved. Frequent fluctuations of the model may be caused if each negative feedback is fine-tuned. By setting the threshold, the system can maintain the stability of the model while ensuring that the feedback is processed. This avoids the problem of reduced or unstable model performance due to frequent adjustments. By setting the second threshold, the system can fine tune the truly problematic search category. Thus, the accuracy and the effectiveness of feedback can be ensured, and the situation of misjudgment or excessive adjustment is avoided.
By acquiring feedback and behavior data of the user, interaction and participation of the user and the system can be increased, and loyalty and viscosity of the user are improved. Meanwhile, the participation and contribution of the users can promote the development and perfection of the system, and a virtuous circle is formed.
In summary, the classification search method based on the GPT big model can bring the benefits and effects of improving user satisfaction, improving model performance, personalizing search experience, adjusting and optimizing in real time, strengthening user participation and interaction and the like by acquiring user feedback and performing corresponding processing. These benefits and effects help to improve the overall performance and skill level of the information retrieval field, providing better search experience and service for users.
The embodiment provides a classification retrieval system based on a GPT large model, which comprises:
the preprocessing module is used for preprocessing the original data; obtaining preprocessed data; the preprocessed data comprise preprocessed training data and preprocessed data to be retrieved;
The marking module is used for marking the preprocessed training data;
The evaluation module is used for establishing a data evaluation model and evaluating the marked training data to obtain an evaluation result; training data optimization according to the evaluation result; obtaining optimized training data;
the first fine tuning module is used for carrying out first fine tuning on the pre-constructed GPT model through the optimized training data;
the classification retrieval module is used for acquiring data to be retrieved, inputting the data to be retrieved into the first fine-tuned GPT model, and obtaining a retrieval result;
the feedback optimization module is used for acquiring user feedback and further optimizing the first fine-tuned GPT model according to the user feedback; the user feedback includes:
search content in which the repetition number of the same search by any user exceeds a preset number threshold in a first preset time period;
And the corresponding search content when the negative feedback times of the plurality of users for the same type search exceeds a second threshold value within a second preset time.
The working principle of the technical scheme is as follows: this step includes the collection, cleaning and sorting of large amounts of raw data in order to remove extraneous or erroneous information, ensuring the quality and usability of the data. The preprocessed data will be used for subsequent training and retrieval processes.
After preprocessing, the data needs to be annotated. Labeling refers to the process of adding semantic information to data so that the GPT model can understand and learn. The annotated data will be used to train the GPT model.
By establishing a data evaluation model, the quality evaluation can be carried out on the marked data. The evaluation model can be evaluated according to factors such as diversity, accuracy and representativeness of data. According to the evaluation result, the data can be further processed and optimized to improve the overall quality of the training data.
And fine tuning the pre-constructed GPT model by using the optimized training data, so that the GPT model can be better adapted to specific tasks and data distribution. The fine tuning process includes adjusting and optimizing parameters of the model to improve its performance in the class retrieval task.
When new data to be retrieved exists, the new data to be retrieved is input into a first fine-tuned GPT model, and the model can perform reasoning and analysis according to the learned semantic information and structure to generate a corresponding retrieval result. These results may be highly correlated with the data to be retrieved, or may have some correlation with it.
In order to further improve the accuracy and efficiency of the search, the system collects feedback of the user on the search result. The user feedback may include data on the number of clicks, dwell time, repetition, etc. for a certain search result. And meanwhile, the negative feedback times of the user on the similar search are also collected. According to the feedback information, the system can further optimize the first fine-tuned GPT model so as to better meet the requirements of users.
The technical scheme has the effects that: the quality and the readability of the data can be greatly improved through preprocessing steps such as removing noise, word segmentation, removing stop words and the like. This is critical to the accuracy of the subsequent data labeling, model training, and retrieval results. Marking the preprocessed training data, and establishing a data evaluation model to evaluate the marked data, so that the reliability and quality of the training data can be ensured. The data optimization is carried out according to the evaluation result, so that the accuracy of training data can be further improved, and powerful guarantee is provided for model training. By fine tuning the pre-constructed GPT model by using the optimized training data, the classification retrieval performance of the model can be improved. The fine tuning process can be adjusted according to specific requirements of a specific field, so that the model is better adapted to data distribution and retrieval requirements of the specific field. Obtaining user feedback and further optimizing the model is an important element of the method. By analyzing the repetition times of the same search and the negative feedback times of the same search, the requirements and the behavior patterns of the user can be deeply known, so that the model is pertinently optimized, and the accuracy of classified search and the satisfaction degree of the user are improved. The method can timely acquire user feedback and dynamically adjust model parameters, so that the classification retrieval result meets the real-time requirements of the user. At the same time, the dynamic adjustment mechanism is also helpful to deal with the change of data distribution and the change of user behavior mode, and keeps the continuous optimization and updating of the model. By analyzing the user behavior data and acquiring feedback, the method can provide more personalized search service for the user. For example, based on the user's search history and preferences, the system may recommend content of related fields or topics to meet the user's personalized needs. Such personalized services help to improve user satisfaction and loyalty.
The classification retrieval system based on the GPT large model in the embodiment, wherein the labeling module comprises:
randomly transmitting the preprocessed data to a background terminal for first labeling;
Acquiring first labeling data of a background terminal; inputting the first annotation data as a training set into an annotation training model;
Performing second labeling on the residual data through a labeling training model;
randomly transmitting the second labeling data to a plurality of background terminals for inspection; obtaining a plurality of test results; wherein each test result includes a plurality of different indicators;
Averaging the same index of the plurality of test results, and comparing the same index with a preset threshold value of the corresponding index to obtain a first comparison result;
And adjusting the optimization labeling training model according to the first results.
The working principle of the technical scheme is as follows: and randomly transmitting the preprocessed data to a background terminal for preliminary labeling, wherein the preliminary labeling is the first step of the labeling process. This step aims at adding preliminary semantic information to the data, providing a basis for subsequent model training. Preliminary annotation data, referred to as first annotation data, is obtained from the background terminal. These data are then used as a training set and input into the labeling training model. The labeling training model can be trained by adopting machine learning, deep learning and other technologies; and further labeling the rest data by using a labeling training model. This step is to ensure that all data is accurately and comprehensively labeled, enriching the training dataset of the model. And randomly transmitting the second marked data to a plurality of background terminals for inspection so as to evaluate the accuracy and consistency of marking. Each background terminal returns a test result including a plurality of different indexes such as marking error rate, consistency, etc. And averaging the same index of the plurality of background terminals, and comparing the same index with a preset corresponding index threshold. By this comparison, a first comparison result can be obtained. This step is to ensure the reliability and accuracy of the annotation data. And adjusting and optimizing the labeling training model according to the first comparison results. Such adjustments may include parameters of the model, structure, or training strategies, among others. The optimization aims at improving the performance and accuracy of the model in the subsequent classified retrieval task. Through the steps, the process of data annotation and the performance of the model can be continuously optimized. The iterative process is helpful to improve the accuracy and efficiency of the classified retrieval method, so that the requirements of users are better met.
The technical scheme has the effects that: through multi-round labeling, evaluation and optimization, the labeling quality of training data can be ensured. The method can more accurately identify and classify the data, and reduces labeling errors and omission, thereby improving the accuracy and reliability of subsequent model training. The marking efficiency can be remarkably improved by randomly distributing the data to the background terminal for marking and automatically marking the residual data by using a marking training model. The method can rapidly process a large amount of data, shortens the labeling time and the cost, and accelerates the development and deployment of the whole classified retrieval system. By collecting the feedback of the user on the retrieval result, the model can be adjusted and optimized in real time, and the retrieval accuracy and the user satisfaction are improved. The method can dynamically adjust according to the user demands, enhance the self-adaptive capacity of the model, and provide more personalized and accurate information retrieval service. The steps of preprocessing, labeling, evaluation, optimization and the like are comprehensively utilized, so that the classification retrieval method based on the GPT large model is more accurate, efficient and reliable on the whole. The method can meet the requirements of users on diversification and instantaneity of information retrieval, and provides powerful support for various application scenes. In summary, the classification retrieval method based on the GPT large model can remarkably improve the quality, efficiency, accuracy and overall performance of data annotation by combining dynamic adjustment of user feedback through multiple rounds of annotation, evaluation and optimization. The method has wide application prospect and value, and can bring important innovation and breakthrough to the field of information retrieval.
The classifying and retrieving system based on the GPT large model in the embodiment, wherein the evaluating module comprises:
Establishing a data evaluation model, and evaluating the marked data through the evaluation model;
The data evaluation model is as follows:
P is the evaluation model score; n i is the training number of each label, and m is the total number of labels of the training data; delta N is the difference of the training data quantity of any two labels; a median of a plurality of different tag training data amounts N Z; n Z a minimum of a number of different tag training data amounts; m is the total number of preset labels; the accuracy of the Z label; alpha is a coefficient, ranging from (0, 1); w1, w2 and w2 are weights;
f is the frequency of simultaneous occurrence of any two different labels in the same document or data point;
F max is the maximum value of the frequency of simultaneous occurrence of any two different tags in the same document or data point;
f min is the minimum of the frequency of simultaneous occurrence of any two different labels in the same document or data point; abs () is absolute value;
if P is lower than the preset first threshold value, training data is increased or acquired again.
The working principle of the technical scheme is as follows: firstly, a data evaluation model is required to be established for objectively and comprehensively evaluating the marked training data. This evaluation model is intended to measure the diversity and representativeness of the training data and whether it can support subsequent GPT model training.
And calculating a plurality of evaluation indexes such as label accuracy, label quantity difference, median value and minimum value of label distribution and the like through a model formula. These indicators provide detailed information about the quality of the data, helping to assess the overall quality of the annotated data.
The preset first threshold is a key parameter used for judging the qualification standard of the marked data. If the evaluation model score P is below this threshold, it is indicated that the quality of the current training data is not satisfactory and that more training data needs to be added or acquired again.
If the evaluation model score P is lower than a preset first threshold, data adjustment and optimization are needed. This may include re-labeling portions of the data, adding new training data, adjusting tag distribution, etc. to ensure that the diversity and quality of the training data meets the needs of subsequent GPT model training. The first threshold may be obtained through historical data or a knowledge base;
Through the steps, the performance of the data labeling process and the model can be continuously optimized. The iterative process is helpful to improve the accuracy and efficiency of the classified retrieval method, so that the requirements of users are better met.
In summary, the classification retrieval method based on the GPT large model carries out comprehensive evaluation on the labeled training data by establishing a data evaluation model. By calculating multiple evaluation indexes and threshold comparison, the diversity and quality of the data can be objectively measured. The method is helpful for optimizing the labeling and training process and improving the accuracy and efficiency of classified retrieval.
The technical scheme has the effects that: the data evaluation model can comprehensively evaluate the marked training data, and ensure the accuracy and diversity of the data. If the evaluation score is lower than the preset threshold, the training data can be increased or acquired again, so that the overall quality of the data is guaranteed.
The index in the data evaluation model may reflect the tag distribution of the data. By adjusting and optimizing the label distribution, the representativeness of the training data can be improved, and the accuracy and the efficiency of classified retrieval can be improved.
The creation of the data evaluation model may facilitate iterative optimization. By means of weight adjustment and threshold comparison of different indexes, the data labeling and training process can be continuously optimized, and the performance and accuracy of the classification retrieval method are improved.
The data evaluation model can be built to reduce the data quantity to be marked and reduce the marking cost. Meanwhile, the frequency of re-acquiring or increasing training data can be reduced by optimizing the data quality and the label distribution, so that the cost is further reduced.
Based on high-quality training data, the GPT large model can learn and understand data characteristics better, and the accuracy of classification retrieval is improved. Meanwhile, through iterative optimization, the performance of the model can be continuously improved, and the retrieval accuracy is improved.
The establishment of the data evaluation model can help to promote generalization capability. Through evaluation and analysis of different indexes, the distribution and the characteristics of the data can be better known, so that the model is better adapted to new data and new scenes.
In summary, the classification retrieval method based on the GPT large model carries out comprehensive evaluation on the labeled training data by establishing the data evaluation model, so that the method can bring the advantages and effects of guaranteeing the data quality, optimizing the label distribution, carrying out iterative optimization, reducing the cost, improving the retrieval accuracy, enhancing the generalization capability and the like. The method is beneficial to improving the overall performance and accuracy of the information retrieval field and provides strong support for various application scenes.
The classified retrieval system based on the GPT large model in the embodiment, wherein the classified retrieval module comprises:
Based on the first fine-tuned GPT model, a semantic vector space is established by calculating the similarity between texts;
Acquiring a text to be searched, inputting the text to be searched into the first fine-tuned GPT model, and acquiring a semantic representation vector corresponding to the text to be searched as a vector to be searched; matching the similarity between the vector to be searched and the text vector in the semantic vector space to obtain a search result;
And according to the classification labels, the search results are ordered and displayed according to a certain rule.
The working principle of the technical scheme is as follows: based on the first trimmed GPT model, a semantic vector space can be established by calculating the similarity between the texts. This process converts text into vector form so that semantic relationships between the text can be quantified and compared.
When the text to be searched is obtained, the text to be searched is input into a first fine-tuned GPT model, and a semantic representation vector corresponding to the text to be searched, which is also called a vector to be searched, is obtained through model calculation. This process converts the text to be retrieved into a form that is comparable to vectors in the semantic vector space.
And matching the similarity between the vector to be searched and the text vector in the semantic vector space to determine the text most similar to the text to be searched. The principle of similarity matching is based on the measurement modes such as cosine similarity or Euclidean distance of vectors, and the most similar text is found out to serve as a search result by calculating the similarity between the vector to be searched and the vector in the semantic vector space.
And according to the classification labels, the search results are ordered and displayed according to a certain rule. The basis for ranking may be a similarity score, a weight of the classification tags, or other relevant factors. The format of the display can be designed according to the actual requirements, such as a list, a card or a chart.
The technical scheme has the effects that: and the GPT model is utilized to carry out text similarity matching, so that the most relevant result with the text to be searched can be quickly searched. Compared with the traditional retrieval method based on keyword matching, the matching based on the semantic vector space is more efficient, and related content can be quickly found in mass data.
The GPT model is fine-tuned, so that text data can be better understood and processed, and the classified retrieval method based on the GPT model can be more accurately matched with text similarity, and accuracy of retrieval results is improved. This helps the user to find the desired information faster, improving the search satisfaction.
The classification retrieval method based on the GPT large model can process various types of text data, including long text, short text, sentences and the like. The method has wider applicability, can be applied to different fields and scenes, and meets the requirements of different users.
By creating a semantic vector space, the method can convert text into a vector form so that similarity relationships between the text can be quantified and interpreted. This helps the user understand the source and basis of the search result, increasing the trustworthiness of the method.
The classification retrieval method based on the GPT large model can be used for carrying out fine tuning and optimization according to different requirements, such as adjusting model parameters, improving a feature extraction method and the like. The method has higher flexibility, and can be adjusted and improved according to actual conditions so as to obtain better performance and effect.
In summary, the classification search method based on the GPT large model can bring benefits and effects of high efficiency, accuracy, generalization capability, interpretability, flexibility and the like by performing text similarity matching and classification search by using the GPT model after the first fine tuning. This helps to improve the overall performance and user experience in the field of information retrieval.
The classified retrieval system based on the GPT large model in the embodiment, wherein the obtaining of the user feedback comprises the following steps:
setting a first preset time period; acquiring search content of which the repetition number of the same search by any user exceeds a preset number threshold value in a first preset time period; taking the corresponding search content as first feedback;
the first preset time period is as follows:
T is a first preset time period; l is the length of the information currently retrieved by the user; t Z is the average time length of each search obtaining result in the user history; l Z is the average information length of the result obtained by each search in the user history record; k y is a preset frequency threshold; k y is more than or equal to 3; beta is the current network congestion coefficient;
Setting a second preset time period; obtaining negative feedback of each user on similar searches in a second preset time period, wherein the similar searches are search sets with search result similarity being greater than or equal to a preset similarity value;
the second preset time period may be daily, weekly;
counting the number of negative feedback of each type of search in the plurality of similar searches in a second preset time period;
If the number of times of searching negative feedback of a certain type in the second preset time period exceeds a second threshold value, performing GPT model fine adjustment by taking the searching content of the corresponding type as second feedback;
Wherein the second threshold is Y:
C Z a total number of such searches within a second preset time period; f y presets the ratio of the negative feedback times of the single category to the total number of searching times of the corresponding single category. C y is the total number of preset individual category search times in the second preset time period.
The working principle of the technical scheme is as follows: and calculating a first preset time period T according to parameters such as a user history retrieval record, a network congestion coefficient and the like. The first preset time period is used for screening out keywords which are repeatedly searched by a user in a period of time so as to acquire search contents of which the repeated times of the same search exceeds a preset time threshold.
And in a first preset time period, the system monitors the searching behavior of the user and records the repetition times of the user to the same searching content. And when the repetition number of a certain search content exceeds a preset number threshold, taking the search content as first feedback.
The second preset time period is to obtain negative feedback of the user on the same type of search. This time period may be daily or weekly, set as desired, to facilitate statistics and analysis of the user's feedback by the system. The system monitors the search behavior of the user within a second preset time period, and particularly focuses on search sets with search result similarity greater than or equal to a preset similarity value. For these homogeneous searches, the system will count the number of negative feedback per category.
And if the number of negative feedback times of a certain category exceeds a second threshold value, performing GPT model fine adjustment by taking the search content of the corresponding category as second feedback. The second threshold is calculated from the total number of searches and the ratio of the number of negative feedback to the number of searches for that category.
According to the first feedback and the second feedback, the system can conduct fine adjustment on the GPT model so as to optimize the performance of the model and improve the accuracy and efficiency of classification retrieval. The fine tuning may be performed by adjusting model parameters, improving feature extraction methods, and the like.
In summary, the classification search method based on the GPT big model obtains repeated search content and negative feedback of the user in a period of time by setting the first preset time period and the second preset time period, and fine-adjusts the GPT model according to the feedback, so as to improve the classification search performance and user experience. The method can continuously optimize the performance of the model according to the actual demands and feedback of the user, and improves the overall level of the information retrieval field.
The technical scheme has the effects that: by acquiring and analyzing repeated search content and negative feedback of a user, the GPT model can be finely adjusted, and the performance and accuracy of classification retrieval are optimized. This helps to improve the satisfaction of the user's search results, meeting the actual demands of the user. Generally, users will not repeatedly search for the same problem in a short period of time unless the previous search result is perceived as not meeting the user's needs or information is not accurate enough. This act of repeating the search may be seen as a user's dissatisfaction or uncertainty with respect to the current search results.
If the user repeatedly searches for the same content in a short time, this may mean that the user feels that the previous search result is insufficient in information amount or insufficient in quality. Users want to obtain more comprehensive and accurate information through multiple searches.
In the field of information retrieval, the needs and behavior patterns of users may change over time. Thus, timely collection and analysis of user data is critical to understanding the current needs and behavior patterns of the user. In a preset time period, the system can collect behavior data of the user more completely, and omission or deviation of the data is avoided. Meanwhile, because the data in the time period are relatively concentrated, more accurate analysis and processing can be performed, and the quality and reliability of the data are improved. The preset time period helps to ease the pressure of data processing and analysis by processing the data in batches. For large-scale data sets, setting the time period may make data processing more efficient and controllable, preventing data overload or processing delay. By setting the preset time period, the system can collect behavior data of the user in a specific time, which helps to ensure timeliness of the data.
This repeated search behavior provides basis for model optimization. When the system detects that the number of repetitions of a search exceeds a preset threshold, this can be seen as a signal indicating that fine tuning or optimization of the model in the field is required in order to provide more accurate search results that meet the user's needs.
Based on the user's feedback (i.e., repeated searches), the system can dynamically adjust and optimize the model to better meet the user's actual needs. This feedback mechanism helps the system to learn and advance continuously, better adapting to the user's search habits and preferences.
By analyzing the repetitive search behavior of the user, the system may also provide more personalized services to the user. For example, for users who search repeatedly, the system may provide them with more detailed, more accurate information, or recommend related resources or links to help the user find the desired content faster.
Through feedback of users, parameters and feature extraction modes of the GPT model can be continuously optimized, and classification and retrieval capabilities of the model are improved. This helps to promote the overall skill level in the information retrieval arts, and promotes the development and progress of the arts.
According to the feedback and the behavior of the user, the requirements and the preferences of different users can be personalized, and more accurate search results meeting the requirements of the users can be provided. This helps to improve the user's search experience, increasing the user's confidence and dependence on the system.
Based on the processing mode of the preset time period, the feedback of the user can be obtained in real time and the fine adjustment of the model can be carried out, so that the model can adapt to and respond to the user demands and market changes in time. This helps to improve the real-time and self-adaptability of the system, and improves the competitiveness and market response capability of the system.
By setting the second threshold, the system may filter out some noise and abnormal negative feedback data. Sometimes, negative feedback from the user may be due to certain specific reasons (e.g., network problems, personal emotions, etc.), rather than problems with the model itself. The threshold may help the system eliminate these inaccurate feedback, improving the effectiveness and accuracy of the feedback. Setting the second threshold may reduce the number of search categories that need to be trimmed. If all of the homogeneous searches are fine-tuned, the computational burden on the system may be increased. By setting the threshold, the system only carries out fine adjustment on the categories with negative feedback times exceeding the threshold, so that the model can be optimized more pertinently, and the efficiency is improved. Frequent fluctuations of the model may be caused if each negative feedback is fine-tuned. By setting the threshold, the system can maintain the stability of the model while ensuring that the feedback is processed. This avoids the problem of reduced or unstable model performance due to frequent adjustments. By setting the second threshold, the system can fine tune the truly problematic search category. Thus, the accuracy and the effectiveness of feedback can be ensured, and the situation of misjudgment or excessive adjustment is avoided.
By acquiring feedback and behavior data of the user, interaction and participation of the user and the system can be increased, and loyalty and viscosity of the user are improved. Meanwhile, the participation and contribution of the users can promote the development and perfection of the system, and a virtuous circle is formed.
In summary, the classification search method based on the GPT big model can bring the benefits and effects of improving user satisfaction, improving model performance, personalizing search experience, adjusting and optimizing in real time, strengthening user participation and interaction and the like by acquiring user feedback and performing corresponding processing. These benefits and effects help to improve the overall performance and skill level of the information retrieval field, providing better search experience and service for users.
The present application has been described in terms of its purpose, performance, advancement, and novelty, and the like, and is thus adapted to the functional enhancement and use requirements highlighted by the patent statutes, but the description and drawings are not limited to the preferred embodiments of the present application, and therefore, all equivalents and modifications that are included in the construction, apparatus, features, etc. of the present application shall fall within the scope of the present application.

Claims (6)

1. The classification retrieval method based on the GPT large model is characterized by comprising the following steps of:
Preprocessing the original data; obtaining preprocessed data; the preprocessed data comprise preprocessed training data and preprocessed data to be retrieved;
Marking the preprocessed training data;
establishing a data evaluation model, and evaluating the marked training data to obtain an evaluation result; training data optimization according to the evaluation result; obtaining optimized training data;
Performing first fine adjustment on a pre-constructed GPT model through optimized training data;
Acquiring data to be retrieved, inputting the data to be retrieved into the first fine-tuned GPT model, and acquiring a retrieval result;
acquiring user feedback, and further optimizing the first fine-tuned GPT model according to the user feedback; the user feedback includes:
search content in which the repetition number of the same search by any user exceeds a preset number threshold in a first preset time period;
And corresponding search content when the negative feedback times of the plurality of users to the same type search exceeds a second threshold value within a second preset time;
the data evaluation model is established, and the marked training data is evaluated to obtain an evaluation result; comprising the following steps:
Establishing a data evaluation model, and evaluating the marked data through the evaluation model;
The data evaluation model is as follows:
P is the evaluation model score; For the training number of each label, m is the total number of labels of the training data; the difference value of the training data quantity of any two labels; A median of a plurality of different tag training data amounts; a minimum of a plurality of different tag training data amounts; m is the total number of preset labels; the accuracy of the Z label; alpha is a coefficient, ranging from (0, 1); w1, w2 and w3 are weights;
f is the frequency of simultaneous occurrence of any two different labels in the same document or data point;
Maximum value for the frequency of simultaneous occurrence of any two different tags in the same document or data point;
A minimum value for the frequency of simultaneous occurrence of any two different tags in the same document or data point; abs () is absolute value;
If P is lower than a preset first threshold value, training data are increased or acquired again;
The obtaining user feedback includes:
setting a first preset time period; acquiring search content of which the repetition number of the same search by any user exceeds a preset number threshold value in a first preset time period; taking the corresponding search content as first feedback;
the first preset time period is as follows:
T is a first preset time period; l is the length of the information currently retrieved by the user; obtaining an average time length of a result for each search in the user history; Obtaining an average information length of a result for each search in the user history; A preset time threshold value; the current network congestion coefficient is the current network congestion coefficient;
Setting a second preset time period; obtaining negative feedback of each user on similar searches in a second preset time period, wherein the similar searches are search sets with search result similarity being greater than or equal to a preset similarity value;
counting the number of negative feedback of each type of search in the plurality of similar searches in a second preset time period;
If the number of times of searching negative feedback of a certain type in the second preset time period exceeds a second threshold value, performing GPT model fine adjustment by taking the searching content of the corresponding type as second feedback;
Wherein the second threshold is Y:
The total number of such searches within a second preset time period; Presetting the ratio of the negative feedback times of a single category to the total number of searching times of the corresponding single category; The total number of single category search times is preset for a second preset time period.
2. The GPT large-model-based classification retrieval method according to claim 1, wherein the preprocessed training data is labeled; comprising the following steps:
randomly transmitting the preprocessed data to a background terminal for first labeling;
Acquiring first labeling data of a background terminal; inputting the first annotation data as a training set into an annotation training model;
Performing second labeling on the residual data through a labeling training model;
randomly transmitting the second labeling data to a plurality of background terminals for inspection; obtaining a plurality of test results; wherein each test result includes a plurality of different indicators;
Averaging the same index of the plurality of test results, and comparing the same index with a preset threshold value of the corresponding index to obtain a first comparison result;
And adjusting the optimization labeling training model according to the first results.
3. The classification retrieval method based on the GPT large model according to claim 1, wherein the obtaining the data to be retrieved inputs the data to be retrieved into the first fine-tuned GPT model to obtain a retrieval result; comprising the following steps:
Based on the first fine-tuned GPT model, a semantic vector space is established by calculating the similarity between texts;
Acquiring a text to be searched, inputting the text to be searched into the first fine-tuned GPT model, and acquiring a semantic representation vector corresponding to the text to be searched as a vector to be searched; matching the similarity between the vector to be searched and the text vector in the semantic vector space to obtain a search result;
And according to the classification labels, the search results are ordered and displayed according to a certain rule.
4. A GPT large model-based classification retrieval system, the system comprising:
the preprocessing module is used for preprocessing the original data; obtaining preprocessed data; the preprocessed data comprise preprocessed training data and preprocessed data to be retrieved;
The marking module is used for marking the preprocessed training data;
The evaluation module is used for establishing a data evaluation model and evaluating the marked training data to obtain an evaluation result; training data optimization according to the evaluation result; obtaining optimized training data;
the first fine tuning module is used for carrying out first fine tuning on the pre-constructed GPT model through the optimized training data;
the classification retrieval module is used for acquiring data to be retrieved, inputting the data to be retrieved into the first fine-tuned GPT model, and obtaining a retrieval result;
the feedback optimization module is used for acquiring user feedback and further optimizing the first fine-tuned GPT model according to the user feedback; the user feedback includes:
search content in which the repetition number of the same search by any user exceeds a preset number threshold in a first preset time period;
And corresponding search content when the negative feedback times of the plurality of users to the same type search exceeds a second threshold value within a second preset time;
The evaluation module includes:
Establishing a data evaluation model, and evaluating the marked data through the evaluation model;
The data evaluation model is as follows:
P is the evaluation model score; For the training number of each label, m is the total number of labels of the training data; the difference value of the training data quantity of any two labels; A median of a plurality of different tag training data amounts; a minimum of a plurality of different tag training data amounts; m is the total number of preset labels; the accuracy of the Z label; alpha is a coefficient, ranging from (0, 1); w1, w2 and w3 are weights;
f is the frequency of simultaneous occurrence of any two different labels in the same document or data point;
Maximum value for the frequency of simultaneous occurrence of any two different tags in the same document or data point;
A minimum value for the frequency of simultaneous occurrence of any two different tags in the same document or data point; abs () is absolute value;
If P is lower than a preset first threshold value, training data are increased or acquired again;
The obtaining user feedback includes:
setting a first preset time period; acquiring search content of which the repetition number of the same search by any user exceeds a preset number threshold value in a first preset time period; taking the corresponding search content as first feedback;
the first preset time period is as follows:
T is a first preset time period; l is the length of the information currently retrieved by the user; obtaining an average time length of a result for each search in the user history; Obtaining an average information length of a result for each search in the user history; A preset time threshold value; the current network congestion coefficient is the current network congestion coefficient;
Setting a second preset time period; obtaining negative feedback of each user on similar searches in a second preset time period, wherein the similar searches are search sets with search result similarity being greater than or equal to a preset similarity value;
counting the number of negative feedback of each type of search in the plurality of similar searches in a second preset time period;
If the number of times of searching negative feedback of a certain type in the second preset time period exceeds a second threshold value, performing GPT model fine adjustment by taking the searching content of the corresponding type as second feedback;
Wherein the second threshold is Y:
The total number of such searches within a second preset time period; Presetting the ratio of the negative feedback times of a single category to the total number of searching times of the corresponding single category; The total number of single category search times is preset for a second preset time period.
5. The GPT-large model-based classification retrieval system of claim 4, wherein the labeling module comprises:
randomly transmitting the preprocessed data to a background terminal for first labeling;
Acquiring first labeling data of a background terminal; inputting the first annotation data as a training set into an annotation training model;
Performing second labeling on the residual data through a labeling training model;
randomly transmitting the second labeling data to a plurality of background terminals for inspection; obtaining a plurality of test results; wherein each test result includes a plurality of different indicators;
Averaging the same index of the plurality of test results, and comparing the same index with a preset threshold value of the corresponding index to obtain a first comparison result;
And adjusting the optimization labeling training model according to the first results.
6. The GPT-large-model-based class retrieval system of claim 4, wherein the class retrieval module comprises:
Based on the first fine-tuned GPT model, a semantic vector space is established by calculating the similarity between texts;
Acquiring a text to be searched, inputting the text to be searched into the first fine-tuned GPT model, and acquiring a semantic representation vector corresponding to the text to be searched as a vector to be searched; matching the similarity between the vector to be searched and the text vector in the semantic vector space to obtain a search result;
And according to the classification labels, the search results are ordered and displayed according to a certain rule.
CN202410056662.4A 2024-01-15 2024-01-15 Classification retrieval method and system based on GPT large model Active CN117891898B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410056662.4A CN117891898B (en) 2024-01-15 2024-01-15 Classification retrieval method and system based on GPT large model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410056662.4A CN117891898B (en) 2024-01-15 2024-01-15 Classification retrieval method and system based on GPT large model

Publications (2)

Publication Number Publication Date
CN117891898A CN117891898A (en) 2024-04-16
CN117891898B true CN117891898B (en) 2024-06-28

Family

ID=90650316

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410056662.4A Active CN117891898B (en) 2024-01-15 2024-01-15 Classification retrieval method and system based on GPT large model

Country Status (1)

Country Link
CN (1) CN117891898B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117370586A (en) * 2023-09-13 2024-01-09 北京达佳互联信息技术有限公司 Information display method and device, electronic equipment and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3531303A1 (en) * 2018-02-27 2019-08-28 Micware Co., Ltd. Information retrieval apparatus, information retrieval system, information retrieval method, and program

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117370586A (en) * 2023-09-13 2024-01-09 北京达佳互联信息技术有限公司 Information display method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN117891898A (en) 2024-04-16

Similar Documents

Publication Publication Date Title
CN110297988B (en) Hot topic detection method based on weighted LDA and improved Single-Pass clustering algorithm
CN109165294B (en) Short text classification method based on Bayesian classification
RU2517368C2 (en) Method and apparatus for determining and evaluating significance of words
CN111104526A (en) Financial label extraction method and system based on keyword semantics
US20100306123A1 (en) Information retrieval method, user comment processing method, and systems thereof
US8140337B2 (en) Apparatus, method and program for text mining
CN107239564B (en) Text label recommendation method based on supervision topic model
US10387805B2 (en) System and method for ranking news feeds
CN117151870B (en) Portrait behavior analysis method and system based on guest group
CN111767378A (en) Method and device for intelligently recommending scientific and technical literature
CN114491034B (en) Text classification method and intelligent device
CN113051462A (en) Multi-classification model training method, system and device
CN113032573B (en) Large-scale text classification method and system combining topic semantics and TF-IDF algorithm
CN113392920B (en) Method, apparatus, device, medium, and program product for generating cheating prediction model
CN116610853A (en) Search recommendation method, search recommendation system, computer device, and storage medium
CN117891898B (en) Classification retrieval method and system based on GPT large model
CN111382265B (en) Searching method, device, equipment and medium
CN112732908B (en) Test question novelty evaluation method and device, electronic equipment and storage medium
CN114741471A (en) Personalized mixed recommendation method based on text mining and multi-view fusion
CN115168700A (en) Information flow recommendation method, system and medium based on pre-training algorithm
CN110119464B (en) Intelligent recommendation method and device for numerical values in contract
CN110597996B (en) Chinese webpage classification method based on brainstorming optimization algorithm
CN117668236B (en) Analysis method, system and storage medium of patent standard fusion system
CN117332066B (en) Intelligent agent text processing method based on large model
CN117875309B (en) Public opinion analysis method, device and medium based on big data and deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant