CN116910275B - Form generation method and system based on large language model - Google Patents

Form generation method and system based on large language model Download PDF

Info

Publication number
CN116910275B
CN116910275B CN202311173289.2A CN202311173289A CN116910275B CN 116910275 B CN116910275 B CN 116910275B CN 202311173289 A CN202311173289 A CN 202311173289A CN 116910275 B CN116910275 B CN 116910275B
Authority
CN
China
Prior art keywords
input data
classification
semantic
classifying
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311173289.2A
Other languages
Chinese (zh)
Other versions
CN116910275A (en
Inventor
柴亚团
黄凯凯
陈思远
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuxi Rongzhi Technology Co ltd
Original Assignee
Wuxi Rongzhi Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuxi Rongzhi Technology Co ltd filed Critical Wuxi Rongzhi Technology Co ltd
Priority to CN202311173289.2A priority Critical patent/CN116910275B/en
Publication of CN116910275A publication Critical patent/CN116910275A/en
Application granted granted Critical
Publication of CN116910275B publication Critical patent/CN116910275B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Animal Behavior & Ethology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a form generation method and a system based on a large language model, which belong to the technical field of data processing, wherein the method comprises the following steps: receiving input data of a user; determining a domain ontology of input data; calculating primary classification characteristics of the input data in the field ontology range of the input data; determining a classification parameter based on the feature vector corresponding to the primary classification feature, and adjusting the feature vector based on the classification parameter; adopting a two-point intersection algorithm, adjusting classification parameters, secondarily classifying the classification characteristics, classifying the input data according to the secondary classification characteristics, and obtaining multi-class sub-input data; carrying out semantic recognition on various sub-input data, and calculating semantic similarity after semantic recognition; when the semantic similarity is smaller than the preset similarity, generating semantic summarization words of various sub-input data by using a large predictive model, otherwise, readjusting classification parameters to classify; and combining sub-input data corresponding to the semantic summarization words to generate a display form.

Description

Form generation method and system based on large language model
Technical Field
The invention belongs to the technical field of data processing, and particularly relates to a form generation method and system based on a large language model.
Background
The rapid generation of summarized forms based on user input may increase the efficiency of text work and review by reviewers. The form contains the necessary components of a standard document, such as the report document including the title, the chapter of the report, the main content of the report, the object to listen to the report, the date of the time, etc. Through the template, a form meeting the standard can be quickly made, the current form template is mainly made manually, which means that for each text to be generated, the document is required to be manually consulted, the components of the text are determined one by one, then the proper form template can be made, especially for a large number of documents without the predefined form template, and for a large number of documents without the form template, the method is low in efficiency, time-consuming and huge in labor input.
Disclosure of Invention
The invention provides a form generation method and a form generation system based on a large language model, which aim to solve the technical problems of low efficiency and long time consumption of manual production in the prior art.
First aspect
The invention provides a form generation method based on a large language model, which comprises the following steps:
s101: receiving input data of a user;
s102: determining a domain ontology of the input data based on a domain knowledge graph, wherein the domain knowledge graph comprises WordNet, DBpedia or YAGO;
s103: in the field body range of the input data, calculating primary classification features of the input data by taking the first constraint condition that the classification duration is smaller than the preset duration and the class feature similarity is larger than the preset similarity;
s104: determining a classification parameter based on a feature vector corresponding to the primary classification feature, and adjusting the feature vector based on the classification parameter, wherein the feature vector is a binary number corresponding to the primary classification feature;
s105: adopting a two-point intersection algorithm to adjust the classifying parameters, secondarily classifying the classifying features by taking the minimum classifying resource consumption as a second constraint condition to obtain secondary classifying features, classifying the input data according to the secondary classifying features, and obtaining multi-class sub-input data;
s106: carrying out semantic recognition on various sub-input data by using a large language model, and calculating semantic similarity after semantic recognition;
s107: if the semantic similarity is smaller than the preset similarity, entering S108, otherwise, returning to S105;
s108: generating semantic summarization words of various sub-input data by using a large language model;
s109: combining sub-input data corresponding to the semantic summarization words to generate a display form;
wherein S103 specifically includes:
s1031: calculating the classifying duration:
L=M·∑T α R α
wherein L represents the classifying duration, M represents the paragraph number of the input data, T α Indicating the required length for alpha-stage classification, R α Representing the signal-to-noise ratio of the receiving end of the alpha segment;
s1032: calculating category feature similarity:
wherein s represents the similarity of category characteristics, x β Representing the beta category characteristic, a represents a classification space smoothing coefficient, and f represents a classification space friction coefficient;
s1033: fusing the class features with the classifying duration smaller than the preset duration and the class feature similarity larger than the preset similarity to obtain primary classifying features;
s104 specifically comprises:
s1041: the maximum posterior assumption is made for the primary classification feature:
wherein N represents the maximum posterior assumption calculated by the maximum posterior assumption, p (c) represents the prior probability of the category c, p (x) represents the probability of the observed data x, N represents the number of primary classification features, p (x|c) represents the occurrence probability of the observed data x under the category c, and minL represents the loss function represented by the minimum classification duration;
s1042: calculating classification parameters by combining the maximum posterior assumption:
wherein D represents an objective function of the feature vector a, n represents the number of once classified features, η represents a classification parameter, v represents an auxiliary vector of the feature vector, that is, an auxiliary parameter of the feature vector, and p (v|neighbor (v)) represents an occurrence probability of the auxiliary vector under a neighbor object condition of the given auxiliary vector;
the classifying features are secondarily classified by taking the minimum classifying resource consumption as a second constraint condition to obtain secondary classifying features, the input data are classified according to the secondary classifying features, and the obtaining of the multi-class sub-input data specifically comprises:
s1051: and carrying out secondary classification on the classification characteristics by taking minimum classification resource consumption as a second constraint condition:
wherein e represents the consumption of classified resources, x' γ Representing the secondary classification characteristics, alpha γ Representing the calculation rate for obtaining the secondary classification characteristics of the gamma-class, wherein Y represents the consumption of periodic resources in the classification process, W represents the probability weight and P γ Representing the prior probability of the gamma-type secondary classification characteristic, R γ Representing the signal-to-noise ratio of a receiving end for obtaining the gamma-type secondary classification characteristic;
s1052: and classifying the input data according to the secondary classification characteristics to obtain multi-class sub-input data.
Second aspect
The invention provides a form generating system based on a large language model, which is used for executing a form generating method based on the large language model in the first aspect.
Compared with the prior art, the invention has at least the following beneficial technical effects:
in the invention, the domain ontology of the input data is predetermined through the domain knowledge graph, the calculation range of the large language model is reduced, and the accuracy of form generation is improved while the form generation time is reduced. In addition, input data are classified for multiple times by combining classifying duration and category feature similarity in the whole form generation process, classifying accuracy of the input data is improved while classifying efficiency and classifying feasibility are guaranteed, and accuracy of form generation is further improved. After multiple classification, the semantic analysis capability of the existing large language model is fully utilized, semantic analysis is carried out on the classified multiple sub-input data, the semantic similarity after analysis is calculated, the generation of the semantic summarization word of the form is carried out only after the semantic similarity is smaller than a certain degree, otherwise, the classification parameters are adjusted to reclassify, the high distinguishability of the classification result is improved, and the distinguishability of the finally generated form is improved. The automatic form generation can greatly improve the working efficiency and quality of the text, so that the text processing process is more efficient and accurate, and the burden of manual processing is reduced.
Drawings
The above features, technical features, advantages and implementation of the present invention will be further described in the following description of preferred embodiments with reference to the accompanying drawings in a clear and easily understood manner.
FIG. 1 is a schematic flow chart of a form generating method based on a large language model.
Detailed Description
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following description will explain the specific embodiments of the present invention with reference to the accompanying drawings. It is evident that the drawings in the following description are only examples of the invention, from which other drawings and other embodiments can be obtained by a person skilled in the art without inventive effort.
For simplicity of the drawing, only the parts relevant to the invention are schematically shown in each drawing, and they do not represent the actual structure thereof as a product. Additionally, in order to simplify the drawing for ease of understanding, components having the same structure or function in some of the drawings are shown schematically with only one of them, or only one of them is labeled. Herein, "a" means not only "only this one" but also "more than one" case.
It should be further understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.
In this context, it should be noted that the terms "mounted," "connected," and "connected" are to be construed broadly, and may be, for example, fixedly connected, detachably connected, or integrally connected, unless explicitly stated or limited otherwise; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention will be understood in specific cases by those of ordinary skill in the art.
In addition, in the description of the present invention, the terms "first," "second," and the like are used merely to distinguish between descriptions and are not to be construed as indicating or implying relative importance.
Example 1
In one embodiment, referring to fig. 1 of the specification, a schematic flow chart of a form generating method based on a large language model provided by the invention is shown.
The invention provides a form generation method based on a large language model, which comprises the following steps:
s101: input data of a user is received.
The input data may be text data, numerical data or other structured data commonly used in natural language processing, specifically may be sentences, paragraphs, documents, etc., and then text classification, feature extraction, semantic similarity calculation, etc. are performed to generate a final form.
S102: based on the domain knowledge graph, determining a domain ontology of the input data.
Wherein, the domain knowledge graph comprises WordNet, DBpedia or YAGO.
It should be noted that, the domain ontology of the input data is determined based on the domain knowledge graph, which means that the domain information related to the input data is found by querying the domain knowledge graph according to the provided input data, which is helpful for subsequent data processing and analysis, so as to correlate the input data with the information related to the domain, reduce the analysis range of semantic analysis, and improve the analysis accuracy and analysis efficiency.
S103: and in the field ontology range of the input data, calculating one-time classification characteristics of the input data by taking the first constraint condition that the classification duration is smaller than the preset duration and the class characteristic similarity is larger than the preset similarity.
It should be noted that, the size of the preset duration may be set by those skilled in the art according to actual needs, and the present invention is not limited herein.
In one possible implementation, S103 specifically includes:
s1031: calculating the classifying duration:
L=M·∑T α R α
wherein L represents the classifying duration, M represents the paragraph number of the input data, T α Indicating the required length for alpha-stage classification, R α Representing the signal-to-noise ratio of the receiving end of the alpha segment;
s1032: calculating category feature similarity:
wherein s represents the similarity of category characteristics, x β Representing the beta category characteristic, a represents a classification space smoothing coefficient, and f represents a classification space friction coefficient;
s1033: and fusing the class characteristics with the classifying duration smaller than the preset duration and the class characteristic similarity larger than the preset similarity to obtain primary classifying characteristics.
It should be noted that, the size of the preset similarity can be set by those skilled in the art according to actual needs, and the present invention is not limited herein.
It should be noted that, determining the time required for classifying the input data, calculating the total time length, the number of paragraphs of the input data, the time length required for classifying the ith paragraph, and the signal-to-noise ratio of the receiving end of the ith paragraph, this calculation helps to understand the overall classifying time, and the processing time of each part can be determined according to the attribute and the signal-to-noise ratio of different paragraphs. The system calculates the similarity between different types of features, which can be understood as calculating the association degree between different features, wherein the similarity between the types of features needs to comprehensively consider different factors, such as a classification space smoothing coefficient and a classification space friction coefficient, so as to establish a similarity measure between the features. Finally, the category features meeting the two conditions are fused, the classifying duration is smaller than the preset duration, and the similarity of the category features is minimum, in other words, the system can select the category features which can show high similarity in a short time to be fused, and the method is favorable for obtaining meaningful results in a limited time. By calculating the classifying duration, the system can predict the time required by the whole processing process, is beneficial to reasonable distribution of time resources, and simultaneously selects the feature with the minimum category feature similarity, so that the processing complexity can be reduced, and the efficiency is improved. The calculation of the similarity of the category features is helpful to select the features which are more similar in terms of semantics, so that the classification accuracy can be enhanced, the category features with high similarity are only considered in fusion, and confusion and interference can be avoided. The classification is completed within a preset time length, so that the resource management is facilitated, and by limiting the classification time, the system can generate a useful result within an effective time, and infinite calculation is avoided, so that the resource use is optimized.
S104: and determining a classification parameter based on the feature vector corresponding to the primary classification feature, and adjusting the feature vector based on the classification parameter.
The feature vector is a binary number corresponding to the primary classification feature.
In one possible implementation, S104 specifically includes:
s1041: the maximum posterior assumption is made for the primary classification feature:
wherein N represents the maximum posterior assumption calculated by the maximum posterior assumption, p (c) represents the prior probability of the category c, p (x) represents the probability of the observed data x, N represents the number of primary classification features, p (x|c) represents the occurrence probability of the observed data x under the category c, and minL represents the loss function represented by the minimum classification duration;
s1042: calculating classification parameters by combining the maximum posterior assumption:
wherein D represents an objective function of the feature vector a, n represents the number of once classified features, η represents a classification parameter, v represents an auxiliary vector of the feature vector, i.e., an auxiliary parameter of the feature vector, and p (v|neighbor (v)) represents the occurrence probability of the auxiliary vector under the condition of a neighbor object of the given auxiliary vector.
It should be noted that, by considering the primary classification feature, the inference is performed according to the bayesian statistical method, specifically, the system considers the prior probability of each category, and the probability of occurrence of the observed data given the primary classification feature, and these information help to comprehensively consider the probability of the category and the distribution situation of the data in the inference process. The system comprehensively considers the maximum posterior hypothesis and calculates the classifying parameters according to an objective function D, wherein the objective function may be a mathematical function related to the classifying parameters, the objective is to make the classified feature vector match with the observed condition of the data as much as possible, and the calculation can help to adjust the classifying parameters so as to better capture the features of the data. By inferring the categorized features based on maximum posterior assumptions, the system can more accurately characterize the relationships between different categories and features, which helps generate feature vectors that are more informative. The adjustment of the classification parameters is helpful for matching the primary classification characteristics with the attributes of the specific data set, which means that the system can perform personalized processing according to different characteristics of the data, thereby improving the processing efficiency and accuracy. By considering the maximum posterior hypothesis and the objective function, the system synthesizes priori knowledge and observation data, thereby comprehensively utilizing various information in the classifying process and improving the comprehensive performance of classification.
S105: and (3) adopting a two-point intersection algorithm, adjusting the classifying parameters, secondarily classifying the classifying features with the minimum classifying resource consumption as a second constraint condition to obtain secondary classifying features, classifying the input data according to the secondary classifying features, and obtaining multi-class sub-input data.
In one possible implementation manner, the classifying feature is secondarily classified under the condition that the classifying resource consumption is minimum as a second constraint condition, so as to obtain a secondary classifying feature, and the input data is classified according to the secondary classifying feature, so as to obtain multi-class sub-input data specifically including:
s1051: and carrying out secondary classification on the classification characteristics by taking minimum classification resource consumption as a second constraint condition:
wherein e represents the consumption of classified resources, x' γ Representing the secondary classification characteristics, alpha γ Representing the calculation rate for obtaining the secondary classification characteristics of the gamma-class, wherein Y represents the consumption of periodic resources in the classification process, W represents the probability weight and P γ Representing the prior probability of the gamma-type secondary classification characteristic, R γ Representing the signal-to-noise ratio of a receiving end for obtaining the gamma-type secondary classification characteristic;
s1052: and classifying the input data according to the secondary classification characteristics to obtain multi-class sub-input data.
It should be noted that, the system uses a two-point cross algorithm, which is an optimization algorithm, and aims to find a better solution in an iterative manner, and the system aims to minimize the classified resource consumption, which may include a calculation rate, a periodic resource consumption, and the like, and by adjusting the classification characteristics, the system tries to find a set of parameters, so as to obtain a better classification result without sacrificing the resource consumption, and avoid the problem of low classification distinguishability in a primary classification process. After obtaining the optimized secondary classification features, the system uses these features to re-classify the input data, which will result in the input data being divided into multiple categories or subsets, each subset representing similar features or attributes. By using the optimization algorithm to carry out secondary classification, the system can obtain a classification result with higher quality under the condition of not increasing resource consumption, and the accuracy and the practicability of classification can be improved. By taking minimum resource consumption as constraint conditions, the system considers effective utilization of resources when classifying and optimizing, which is helpful for avoiding resource waste and ensuring that the system can efficiently process data and save resources. Through multiple categorizations, the system may further subdivide the input data into multiple subsets, which helps to gain a greater understanding of the characteristics and attributes of the data for more detailed analysis and processing.
In one possible implementation, after S105, the method further includes:
S105A: and storing the secondary classification features by adopting a Mat data structure according to the storage execution rule from top to bottom.
It should be noted that, the Mat data structure mode is adopted to store the classifying features of the test paper document, the specific storing rule is executed from top to bottom, the pointer operation is carried out by using a series of address continuous data mode to the classifying result of the test paper document, so that the complicated addressing process in the traditional test paper document classifying data storage is fundamentally avoided, the addressing time of searching the node memory each time is reduced by directly accessing the classifying result of the test paper document, the efficiency in the classifying and form generating processes is further improved, the frequent machine blocking problem caused by inconsistent calculating and storing speed is avoided, and the form generating process is optimized.
S106: and carrying out semantic recognition on various sub-input data by using the large language model, and calculating semantic similarity after semantic recognition.
In one possible implementation, the large language model includes: the BERT model, GPT model, XLNet model, roBERTa model, or T5 model, S106 is specifically:
s1061: and calculating the semantic similarity after semantic recognition through a cosine similarity calculation formula.
It should be noted that, the system uses advanced large language models, such as BERT, GPT, XLNet, roBERTa or T5, to perform semantic recognition on the sub-input data of different categories, and understand the meaning and context of each sub-input data to obtain deeper semantic information. To measure the semantic similarity between sub-input data, the system uses a cosine similarity calculation formula that quantifies the similarity between two vectors by comparing the angles between them, each sub-input data may be represented as a vector, and the calculation of similarity will help to understand the degree of semantic association between different sub-input data. Semantic meaning of text data can be captured more accurately by utilizing a large language model for semantic recognition, which has higher accuracy than a traditional rule or keyword-based method. The large language model can consider context information of the text to better understand meaning of the text, which is helpful for identifying texts with similar meaning but different expressions, so that more accurate semantic similarity is obtained. Different large language models can be used for adapting to different text types and tasks, and the flexibility enables the system to effectively perform semantic recognition and similarity calculation in different fields and tasks. After obtaining accurate semantic similarity information, subsequent processing and analysis steps can be more specifically performed, so that the performance of the system is further improved.
S107: if the semantic similarity is smaller than the preset similarity, the process goes to S108, otherwise, the process goes back to S105.
It should be noted that, by setting the preset similarity threshold, the system can avoid continuing processing under the condition of low semantic similarity, thereby saving resources and time and improving processing efficiency. Only when the semantic similarity reaches a certain threshold will the processing be continued, which helps to ensure that the system processes more meaningful data, optimizing the final processing result. The preset similarity threshold can be adjusted according to the requirements of specific applications, so that the system can flexibly determine whether to continue processing under different conditions. The size of the preset similarity threshold can be set by a person skilled in the art according to actual needs, and the invention is not limited herein.
S108: semantic summarization words of various types of sub-input data are generated by using a large language model.
In one possible implementation, S108 specifically includes:
s1081: calculating the occurrence probability value of the high-frequency words with occurrence times larger than preset times in each piece of sub-input data:
wherein r is i Representing the number of related sentences containing the high frequency word i, n i Representing the number of sentences containing the high-frequency word i, M representing the number of all sentences in the sub-input data, R representing the number of preset generalized words related to the high-frequency word, f i Representing the frequency of occurrence of the high frequency word i in the sub-input data, qf i Representing the frequency of occurrence, k, of the high frequency word i in the preset generalized word 1 ,k 2 And K represents an empirically set parameter;
s1082: and taking the preset summarization word with the largest occurrence probability as a semantic summarization word.
It should be noted that, the system selects the preset summarization word with the maximum occurrence probability value from the high-frequency words obtained by the calculation as the semantic summarization word of each sub-input data, and the summarization words represent important key words in the text and can represent the main semantic content of the sub-input data. By calculating the probability value of the high-frequency word and selecting the preset summarization word with the largest occurrence probability, the system can more accurately extract the main semantic content of the sub-input data, so that the data can be better summarized. The semantic summarization word is a brief summary of sub-input data, so that main information of the data can be presented more conveniently, and redundancy and noise are reduced. The semantic summarization word helps people understand the main content of the sub-input data more quickly, thereby improving the readability and comprehensiveness of the data.
S109: and combining sub-input data corresponding to the semantic summarization words to generate a display form.
In one possible implementation, S109 specifically includes:
s1091: defining a form structure of the display form according to the semantic summarization word and the quantity of sub-input data corresponding to the semantic summarization word;
s1092: and filling sub-input data corresponding to the semantic summarization words into the display form.
It should be noted that, in the process of generating a form, the system defines and generates a structure of a display form according to the semantic summarization words and the corresponding quantity of sub-input data, where the form structure may include elements such as a title, a subheading, a data field, a chart, etc., and a specific structure and an arrangement mode are designed according to an application scenario and a requirement. The system populates the semantic summarization word and its corresponding sub-input data into the previously defined presentation form structure. A specific filling method may include placing semantic summarization words at the title or subtitle positions of the form and filling detailed information of the sub-input data into the data fields. The generation of the presentation form can integrate complex semantic summaries and data information into an easy-to-understand form, provide a comprehensive view, and more intuitively present data if visual elements such as charts are used, so that the understanding of the data by a user is enhanced. The user can quickly acquire the main information through the form, so that the time for researching and analyzing the data is saved. The presentation forms can help users extract insight and trends from the data, supporting decision making and further research.
In one possible implementation, after S109, the method further includes:
s110: and sending the display form to the front end for rendering, and displaying the rendered display form.
It should be noted that, after the front end receives the display form data from the system, the front end renders the display form according to the structure and the content in the data, and the rendering refers to converting the data into a graphical or textual display form visible to the user, so that the user can intuitively see the content of the display form on the interface, which may include setting appropriate layout, style, font, color, etc. to ensure that the display form is presented on the interface in a manner easy to read and understand, and the user can interact with the display form through interactive operations (such as scrolling, clicking, etc.) so as to understand the data and semantic information in depth.
Compared with the prior art, the invention has at least the following beneficial technical effects:
in the invention, the domain ontology of the input data is predetermined through the domain knowledge graph, the calculation range of the large language model is reduced, and the accuracy of form generation is improved while the form generation time is reduced. In addition, input data are classified for multiple times by combining classifying duration and category feature similarity in the whole form generation process, classifying accuracy of the input data is improved while classifying efficiency and classifying feasibility are guaranteed, and accuracy of form generation is further improved. After multiple classification, the semantic analysis capability of the existing large language model is fully utilized, semantic analysis is carried out on the classified multiple sub-input data, the semantic similarity after analysis is calculated, the generation of the semantic summarization word of the form is carried out only after the semantic similarity is smaller than a certain degree, otherwise, the classification parameters are adjusted to reclassify, the high distinguishability of the classification result is improved, and the distinguishability of the finally generated form is improved. The automatic form generation can greatly improve the working efficiency and quality of the text, so that the text processing process is more efficient and accurate, and the burden of manual processing is reduced.
Example 2
In one embodiment, the present invention provides a form generating system based on a large language model, for executing the form generating method based on the large language model in embodiment 1.
The form generating system based on the large language model provided by the invention can realize the steps and effects of the form generating method based on the large language model in the embodiment 1, and in order to avoid repetition, the invention is not repeated.
Compared with the prior art, the invention has at least the following beneficial technical effects:
in the invention, the domain ontology of the input data is predetermined through the domain knowledge graph, the calculation range of the large language model is reduced, and the accuracy of form generation is improved while the form generation time is reduced. In addition, input data are classified for multiple times by combining classifying duration and category feature similarity in the whole form generation process, classifying accuracy of the input data is improved while classifying efficiency and classifying feasibility are guaranteed, and accuracy of form generation is further improved. After multiple classification, the semantic analysis capability of the existing large language model is fully utilized, semantic analysis is carried out on the classified multiple sub-input data, the semantic similarity after analysis is calculated, the generation of the semantic summarization word of the form is carried out only after the semantic similarity is smaller than a certain degree, otherwise, the classification parameters are adjusted to reclassify, the high distinguishability of the classification result is improved, and the distinguishability of the finally generated form is improved. The automatic form generation can greatly improve the working efficiency and quality of the text, so that the text processing process is more efficient and accurate, and the burden of manual processing is reduced.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The foregoing examples illustrate only a few embodiments of the invention, which are described in detail and are not to be construed as limiting the scope of the invention. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the invention, which are all within the scope of the invention. Accordingly, the scope of protection of the present invention is to be determined by the appended claims.

Claims (7)

1. A form generation method based on a large language model, comprising:
s101: receiving input data of a user;
s102: determining a domain ontology of the input data based on a domain knowledge graph, wherein the domain knowledge graph comprises WordNet, DBpedia or YAGO;
s103: in the field body range of the input data, calculating primary classification features of the input data by taking the first constraint condition that the classification time length is smaller than the preset time length and the class feature similarity is larger than the preset similarity;
s104: determining a classification parameter based on a feature vector corresponding to the primary classification feature, and adjusting the feature vector based on the classification parameter, wherein the feature vector is a binary number value corresponding to the primary classification feature;
s105: the classifying parameters are adjusted by adopting a two-point intersection algorithm, the classifying characteristics are secondarily classified by taking the minimum classifying resource consumption as a second constraint condition, secondary classifying characteristics are obtained, and the input data are classified according to the secondary classifying characteristics, so that multi-class sub-input data are obtained;
s106: carrying out semantic recognition on various sub-input data by utilizing the large language model, and calculating semantic similarity after semantic recognition;
s107: if the semantic similarity is smaller than the preset similarity, entering S108, otherwise, returning to S105;
s108: generating semantic summarization words of various sub-input data by using the large language model;
s109: combining the semantic summarization words and sub-input data corresponding to the semantic summarization words to generate a display form;
wherein, the step S103 specifically includes:
s1031: calculating the classifying duration:
L=M·∑T α R α
wherein L represents the classifying duration, M represents the paragraph number of the input data, T α Indicating the required length for alpha-stage classification, R α Representing the signal-to-noise ratio of the receiving end of the alpha segment;
s1032: calculating the category feature similarity:
wherein s representsThe similarity of the category characteristics, x β Representing the beta category characteristic, a represents a classification space smoothing coefficient, and f represents a classification space friction coefficient;
s1033: fusing the classification characteristics with the classification duration smaller than the preset duration and the classification characteristic similarity larger than the preset similarity to obtain the primary classification characteristics;
the step S104 specifically includes:
s1041: and carrying out maximum posterior assumption on the primary classification characteristics:
wherein N represents the maximum posterior assumption calculated by the maximum posterior assumption, p (c) represents the prior probability of the category c, p (x) represents the probability of the observed data x, N represents the number of the primary classification features, p (x|c) represents the occurrence condition probability of the observed data x under the category c, and minL represents the loss function represented by the minimum classification duration;
s1042: calculating the categorization parameters in combination with the maximum posterior assumption:
wherein D represents an objective function of the feature vector a, n represents the number of the once-classified features, η represents the classification parameter, v represents an auxiliary vector of the feature vector, that is, an auxiliary parameter of the feature vector, and p (v|neighbor (v)) represents an occurrence probability of the auxiliary vector given a neighbor object of the auxiliary vector;
the classifying feature is secondarily classified by taking the minimum classifying resource consumption as a second constraint condition to obtain a secondary classifying feature, and the input data is classified according to the secondary classifying feature, so that the multi-class sub-input data is obtained specifically comprises:
s1051: and carrying out secondary classification on the classification characteristics by taking minimum classification resource consumption as a second constraint condition:
wherein e represents the consumption of the classified resource, x' γ Representing the secondary classification feature, alpha γ Representing the calculation rate for obtaining the secondary classification characteristics of the gamma-th class, wherein Y represents the consumption of periodic resources in the classification process, W represents the probability weight and P γ Representing the prior probability of the secondary classification characteristic of the gamma type, R γ Representing the signal-to-noise ratio of the receiving end for obtaining the secondary classification characteristic of the gamma-th class;
s1052: and classifying the input data according to the secondary classification characteristics to obtain multi-class sub-input data.
2. The large language model based form generation method according to claim 1, further comprising, after S105:
S105A: and storing the secondary classification features by adopting a Mat data structure according to a storage execution rule from top to bottom.
3. The large language model based form generation method of claim 1, wherein the large language model comprises: the BERT model, the GPT model, the XLNet model, the RoBERTa model, or the T5 model, the S106 is specifically:
s1061: and calculating the semantic similarity after semantic recognition through a cosine similarity calculation formula.
4. The large language model based form generation method according to claim 1, wherein S108 specifically comprises:
s1081: calculating the occurrence probability value of the high-frequency words with occurrence times larger than preset times in each piece of sub-input data:
wherein r is i Representing the number of related sentences containing the high frequency word i, n i Representing the number of sentences containing the high-frequency word i, M representing the number of all sentences in the sub-input data, R representing the number of preset generalized words related to the high-frequency word, f i Representing the frequency of occurrence, qf, of the high frequency word i in the sub-input data i Representing the occurrence frequency, k, of the high-frequency word i in the preset generalized word 1 ,k 2 And K represents an experience setting parameter, Q represents a set of high-frequency words with occurrence times greater than preset times in the sub-input data;
s1082: and taking the preset summarization word with the largest occurrence probability as the semantic summarization word.
5. The large language model based form generation method according to claim 1, wherein S109 specifically comprises:
s1091: defining a form structure of the display form according to the semantic summarization word and the quantity of sub-input data corresponding to the semantic summarization word;
s1092: and filling the sub-input data corresponding to the semantic summarization word into the display form.
6. The large language model based form generation method according to claim 1, further comprising, after S109:
s110: and sending the display form to the front end for rendering, and displaying the rendered display form.
7. A large language model based form generation system for performing the large language model based form generation method of any one of claims 1 to 6.
CN202311173289.2A 2023-09-12 2023-09-12 Form generation method and system based on large language model Active CN116910275B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311173289.2A CN116910275B (en) 2023-09-12 2023-09-12 Form generation method and system based on large language model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311173289.2A CN116910275B (en) 2023-09-12 2023-09-12 Form generation method and system based on large language model

Publications (2)

Publication Number Publication Date
CN116910275A CN116910275A (en) 2023-10-20
CN116910275B true CN116910275B (en) 2023-12-15

Family

ID=88355026

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311173289.2A Active CN116910275B (en) 2023-09-12 2023-09-12 Form generation method and system based on large language model

Country Status (1)

Country Link
CN (1) CN116910275B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118069852B (en) * 2024-04-22 2024-07-12 数据空间研究院 Multi-model fusion data classification prediction method and system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112084338A (en) * 2020-09-18 2020-12-15 达而观数据(成都)有限公司 Automatic document classification method, system, computer equipment and storage medium
CN112215007A (en) * 2020-10-22 2021-01-12 上海交通大学 Organization named entity normalization method and system based on LEAM model
CN112699240A (en) * 2020-12-31 2021-04-23 荆门汇易佳信息科技有限公司 Intelligent dynamic mining and classifying method for Chinese emotional characteristic words
CN115221864A (en) * 2022-07-28 2022-10-21 南京航空航天大学 Multi-mode false news detection method and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112084338A (en) * 2020-09-18 2020-12-15 达而观数据(成都)有限公司 Automatic document classification method, system, computer equipment and storage medium
CN112215007A (en) * 2020-10-22 2021-01-12 上海交通大学 Organization named entity normalization method and system based on LEAM model
CN112699240A (en) * 2020-12-31 2021-04-23 荆门汇易佳信息科技有限公司 Intelligent dynamic mining and classifying method for Chinese emotional characteristic words
CN115221864A (en) * 2022-07-28 2022-10-21 南京航空航天大学 Multi-mode false news detection method and system

Also Published As

Publication number Publication date
CN116910275A (en) 2023-10-20

Similar Documents

Publication Publication Date Title
CN110888990B (en) Text recommendation method, device, equipment and medium
Choo et al. Utopian: User-driven topic modeling based on interactive nonnegative matrix factorization
Bian et al. Multimedia summarization for social events in microblog stream
Pietsch et al. Topic modeling for analyzing open-ended survey responses
JP5332477B2 (en) Automatic generation of term hierarchy
CN116910275B (en) Form generation method and system based on large language model
US20160098405A1 (en) Document Curation System
US20090182723A1 (en) Ranking search results using author extraction
AU2015204283A1 (en) Text mining system and tool
JP5391632B2 (en) Determining word and document depth
Sarvabhotla et al. Sentiment classification: a lexical similarity based approach for extracting subjectivity in documents
JP2009093653A (en) Refining search space responding to user input
US20140379719A1 (en) System and method for tagging and searching documents
TW201915777A (en) Financial analysis system and method for unstructured text data
KR20150032164A (en) Active Knowledge Guidance Based on Deep Document Analysis
CN106126605B (en) Short text classification method based on user portrait
JP5218409B2 (en) Related information search system and related information search method
CN109933702B (en) Retrieval display method, device, equipment and storage medium
WO2024139925A1 (en) Method and system for constructing visualization graph based on natural language
Ehrhardt et al. Omission of information: Identifying political slant via an analysis of co-occurring entities
CN112528640A (en) Automatic domain term extraction method based on abnormal subgraph detection
US9785404B2 (en) Method and system for analyzing data in artifacts and creating a modifiable data network
JP2020064463A (en) Information operating device and information operating method
Rybak et al. Machine learning-enhanced text mining as a support tool for research on climate change: theoretical and technical considerations
US20230222281A1 (en) Modifying the presentation of drawing objects based on associated content objects in an electronic document

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant