CN117370527A

CN117370527A - Method and system for constructing traffic construction industry standard knowledge base by using ChatGPT

Info

Publication number: CN117370527A
Application number: CN202311380162.8A
Authority: CN
Inventors: 余莎
Original assignee: Yunji Smart Engineering Co ltd
Current assignee: Yunji Smart Engineering Co ltd
Priority date: 2023-10-23
Filing date: 2023-10-23
Publication date: 2024-01-09

Abstract

The invention discloses a method and a system for constructing a traffic and building industry standard knowledge base by using ChatGPT, wherein the method comprises the following steps: acquiring industry standard data; processing industry standard data into an article module; the ChatGPT processes the article module into vector data; vector data is stored. According to the invention, a question-answering knowledge base system is constructed based on the ChatGPT, and in the construction process, the knowledge base system does not need to be manually carded, and can be used for summarizing the contents of each article and forming corresponding problem points.

Description

Method and system for constructing traffic construction industry standard knowledge base by using ChatGPT

Technical Field

The invention relates to the technical field of knowledge bases, in particular to a method and a system for constructing a standard knowledge base of traffic construction industry by using ChatGPT.

Background

Along with the development of the digital twinning platform, various standards of traffic and building industries are gradually formed, including various standards of work, construction and the like of the country, industry, group and enterprise, and various standard methods of implementation and construction and the like. However, these knowledge are distributed at present and cannot be better converged into our daily work, for example, when writing schemes, more scheme data need to be queried, and after various comparison, a scheme can be formed after the data is queried. In daily construction work, a construction standard is constructed, and knowledge information of construction, regulations and the like is needed to be known so as to write the construction standard. For example, in the process of model design, based on various project characteristics of various industries, the corresponding component construction size standard needs to be known, the coding requirement needs to include information such as elements, and the model design can be performed after the corresponding various scene information needs to be overturned. The daily designer needs to spend a great deal of time to query and collect the data, and the model design work can be performed after analysis and review. The traditional implementation scheme is that professionals are arranged to classify various knowledge, a labeling mode is adopted to construct a content labeling database model, the content is returned by matching with the label, the construction mode cannot meet personalized questioning and additional business requirements of users, and the construction mode cannot meet more complex business scene requirements of the users.

Disclosure of Invention

Therefore, in order to overcome the defects of the prior art, the invention provides the method and the system for constructing the standard knowledge base of the traffic and construction industry by using the ChatGPT, which reduce the labor cost and are convenient to use.

The technical scheme of the invention is that the method for constructing the traffic and building industry standard knowledge base by using the ChatGPT comprises the following steps:

acquiring industry standard data;

processing the industry standard data into an article module;

the ChatGPT processes the article module into vector data;

the vector data is stored.

Further, in the step of processing the industry standard data into an article module, the method comprises the following steps:

preprocessing the industry standard data into text data;

and splitting the text data into the article modules.

Further, in the step of preprocessing the industry standard data into text data, it includes: the picture information is described.

Further, in the step of splitting the text data into the article modules, it includes: and performing coding processing, redundant character processing, segmentation of complete semantic segments, lexical analysis and directory tree analysis on the text data.

Further, in the step of the ChatGPT processing the article module into vector data, it includes:

the ChatGPT carries out fine adjustment on the article module according to the model;

and processing the article module after fine adjustment into the vector data by the ChatGPT.

The invention provides another technical scheme that a system for constructing a traffic and construction industry standard knowledge base by using ChatGPT comprises:

the acquisition module is used for acquiring industry standard data;

the processing module is used for processing the industry standard data into an article module;

the ChatGPT is used for processing the article module into vector data;

and the vector database is used for storing the vector data.

Further, the processing module comprises a preprocessing module and a splitting module;

the preprocessing module is used for preprocessing the industry standard data into text data;

and the splitting module is used for splitting the text data into the article modules.

Further, the preprocessing module includes: the picture information is described.

Further, the splitting module includes: and performing coding processing, redundant character processing, segmentation of complete semantic segments, lexical analysis and directory tree analysis on the text data.

Further, the ChatGPT includes:

According to the invention, a question-answering knowledge base system is constructed based on the ChatGPT, and in the construction process, the knowledge base system does not need to be manually carded, and can be used for summarizing the contents of each article and forming corresponding problem points.

Drawings

Fig. 1 is a flow chart of a method of constructing a traffic, construction industry standard knowledge base using ChatGPT in the present invention.

Fig. 2 is a schematic block diagram of a system for constructing a standard knowledge base of the transportation and construction industry using ChatGPT in the present invention.

Detailed Description

For a thorough understanding of the objects, features and effects of the present invention, reference will be made to the following detailed description of the invention taken in conjunction with the accompanying drawings.

1. The invention provides a method for constructing a traffic and construction industry standard knowledge base by using ChatGPT, which is shown in FIG. 1 and specifically comprises the following steps.

100. Acquiring industry standard data;

200. processing industry standard data into an article module;

ChatGPT processes the article module into vector data;

400. vector data is stored.

In the present invention, in the above step 100, industry standard data is acquired, and the industry standard data includes: the government issues rules documents such as national standard, industry standard, group standard, enterprise standard, professional books, bidding requirements, various bidding books, various processes of traffic construction engineering, construction safety and the like.

In the present invention, in the above step 200, industry standard data is processed into an article module. The method comprises the following steps:

201. the industry standard data is preprocessed into text data.

Preprocessing refers to the step of uniformly converting data into a plain text format and reserving a paragraph format, wherein industry standard data comprises text contents, pictures and tables.

In the process of processing a picture, picture information needs to be described. The method comprises the steps of describing specific contents in a picture, coloring each module of the picture, using the description of a scene range, creating a hash index for the picture, and storing the hash index into the picture description. Or analyzing and summarizing the context content of the picture through the LIamaIndex, then dividing a content module, adding a mark of 'picture content data' to the content of the module, processing according to the flow, and storing the processed content in a vector database.

In processing the table, the header and table content distribution is converted into a txt document in csv format. For the standard of storing multiple tables, such as code dictionaries of various industries specified in various national standards, when a user wants to acquire a corresponding code collection through the industry standard, codes can be automatically generated through a code knowledge base which is processed by the user. For the data of the scene, we can convert the data into the article sections of the table head and the table content preferentially, name the stored files by the table name and the sequence number, add the same data content standard (such as the national industrial internet channel coding standard 1.0 standard pier coding standard) to each table file, and then respectively construct indexes in the LIamaIndex, thereby obtaining the whole table data relationship of the data. When a user needs to use the data for relevant scenarios, such as generating an industrial internet identification code, etc. The system needs to acquire a file list from the vector database according to the questioning content of the user, sends the data content to the ChatGPT in batches, informs the ChatGPT that the codes to be generated need to be generated according to the data in the table, and can acquire the code information wanted by the user through multiple times of adjustment.

202. The text data is split into chapter modules.

The text data is split uniformly according to paragraphs, and the method comprises the steps of coding the text data, processing redundant characters, segmenting complete semantic segments, lexical analysis and directory tree analysis. The ChatGPT maximum token requires (no more than 8191 Tokens per block, as this is the input length constraint of the OpenAI embeddings model) splitting of the document content, clipping into multiple article modules (text chunks).

Redundant character processing, for repeated characters, or data belonging to non-business rules, in-fact descriptions, cases, etc. in documents, can be considered to reduce storage priority or delete the part of content directly.

Lexical analysis, distribution requires rich labeling of article slices from time dimension, space dimension, logical dimension, catalog dimension. Time dimension: for data containing state meaning, time or time meaning, etc., it is necessary to extract separately as a reference tag for the content context of the segment. Such as current rules, failure rules, planning rules, etc., represent current limits, previous validity, future validity, etc. Spatial dimension: for data including position information, area information, country administration area information, and the like, it is necessary to add a space dimension tag. Logic dimension: for containing keywords that must, potentially, etc. include content rules, logical dimension labels need to be added.

And the catalogue dimension is used for adding a catalogue label to the catalogue result of the article module. The document paragraph directory results must be saved with a root directory to current node directory structure

Analyzing the directory tree, checking whether the current directory is the final leaf directory, if so, adding an end flag, and if not, adding downstream node information such as next=current execution standard coding rule.

In the present invention, in step 300 described above, chatGPT processes the article module into vector data. Comprises the following steps.

ChatGPT fine tunes the article module according to the model.

Learning and summarizing the article module and converting the article module into a plurality of vectors with summarization modes. In the fine tuning process, the contents are de-duplicated, the contents are combined, the data are read and summarized, and corresponding problem sets are formed. For OpenAI, to determine the similarity of two text segments, it is necessary to first change the two text segments into a number vector (vector embeddings), just like a stack of coordinate axis numbers, and then a decimal between 0 and 1 can be obtained by comparing the numbers, and the similarity is higher as the number is closer to 1. And merging the sent content and the returned result according to the similarity result, and performing deduplication operation.

And 302, processing the article module after fine adjustment into vector data by the ChatGPT's Embedding.

A python is used for writing a calling program, an API of OpenAI (application programming interface) is called in batches, and the latest mode is text-embedding-ada-002 at present, so that an article module is changed into vector data.

In the present invention, in the above-described step 400, vector data is stored. The vector data is stored in a vector database, and it is noted that the original text blocks and the digital vectors need to be stored together so that the original text can be obtained in the reverse direction from the digital vectors.

Note that ChatGpt: chatGPT is a large language model developed by the OpenAI team. It may accept user input and generate a corresponding natural language response. The model is generated by training using a large corpus of text and can be used for various natural language processing tasks such as language understanding, text generation, machine translation, etc. ChatGPT is open and can be used by developers and researchers to build a variety of language applications and tools.

Embedding (Embedding): embedding is a vector, which is a list of numbers that can be understood by the machine. Mapping natural language into such vectors can help machines understand the meaning of words and the word-to-word relationship.

Vector database (vector database): a vector database is a database that stores information as vectors or arrays of numbers. Each piece of information is represented as a vector, where each number in the vector corresponds to a particular attribute or feature of the data.

LlamaIndex：LlamaIndex(formerly GPT Index)is a data framework for LLM applications to ingest,structure,and access private or domain-specific data。

2. The invention provides a system for constructing a traffic and construction industry standard knowledge base by using a ChatGPT, which is shown in FIG. 2 and comprises an acquisition module 21, a processing module 22, a ChatGPT23 and a vector database 24.

An acquisition module 21, configured to acquire industry standard data;

a processing module 22 for processing industry standard data into an article module;

ChatGP23, for processing the article module into vector data;

vector database 24 for storing vector data.

In this embodiment, the acquiring module 21 acquires industry standard data, where the industry standard data includes: the government issues rules documents such as national standard, industry standard, group standard, enterprise standard, professional books, bidding requirements, various bidding books, various processes of traffic construction engineering, construction safety and the like.

In this embodiment, the processing module 22 includes a preprocessing module and a splitting module. And the preprocessing module is used for preprocessing industry standard data into text data. And the splitting module is used for splitting the text data into a seal module.

And the preprocessing module is used for preprocessing industry standard data into text data. The method comprises the following steps:

The splitting module is configured to split Cheng Wenzhang the text data, and includes the following contents:

In this embodiment, the ChatGPT23 is configured to process the article module into vector data.

ChatGPT23 fine-tunes the article module according to the model.

The ChatGPT 23's Embedding processes the trimmed article module into vector data.

In this embodiment, the vector database 24 is used for storing the vector data. It is noted that the original text block and the digital vector need to be stored together so that the original text can be obtained in the reverse direction from the digital vector.

3. An embodiment is provided, and the method and the system for verifying the construction of the traffic and construction industry standard knowledge base by using the ChatGPT comprise the following steps.

1. The user asks: the system needs to configure a plurality of scene templates, and a user selects different scenes to carry out corresponding questioning.

2. User problem vectorization: the problem posed by the user is subjected to ChatGPT Embedding vectorization conversion to be changed into a digital vector, and the vector database is conveniently used for inquiring the data of each vector module converted in the data processing process.

3. User problem correlation query: after the user questions are converted into vectors, the vector answers are queried in a vector library and extracted according to the relevance. The specific implementation process is that after the digital vector converted by the question asking by the user is obtained, the digital vector is searched by a vector database center, a result set is returned, the result set is scored according to the matching similarity, the higher the score is, the more matching is indicated, and then a related result is returned according to the matching degree in a reverse order.

4. Article module merging: the user questions and preset answers are combined and the result set is added to the promtt of the request ChatGPT.

For example, the user presents a problem: "I want to read the highway engineering information model application standard (JTGT 2421-2021). Pdf, and summarize the model requirements, return results in Xmini's format? "the text paragraphs retrieved from the database are: what is the highway engineering design information model application standard? What is the model requirement? What is the general provision of model requirements? What is the numbering convention required by the model? ", then the final campt looks like the following:

and (5) returning a result:

Plain Text

@startmindmap

* Model requirement

* General provision for

* The information model includes information

* Geometric information

* Geometry of

* Spatial position

* Attribute information

* Identification code

* Classification coding

* Position of

* Size of

* Quantity of

* Type of x

* Materials:, and methods of making same

* Dosage of

* Numbering convention

* Numbering rules of bridges, culverts, tunnels, monitoring facilities and the like in the information model.

@endmindmap

And saving the returned result in a planmol format, and then selectively importing the xmind file.

5. Big model questioning: asking questions to the ChatGPT big model and returning final answers.

6. Fine tuning the model by user questioning: for the condition that the answer returned in the questioning project is not right or unsuitable, negative and correct answers can be given, the wrong points of the ChatGPT are told, the ChatGPT is modified, and the like, and the ChatGPT changes the original model data according to the answer of the user, so that the aim of perfecting the knowledge base by the foot is fulfilled.

Claims

1. A method for constructing a standard knowledge base of traffic and construction industry by using ChatGPT, which is characterized by comprising the following steps:

acquiring industry standard data;

processing the industry standard data into an article module;

the ChatGPT processes the article module into vector data;

the vector data is stored.

2. The method of claim 1, wherein in the step of processing the industry standard data into an article module, comprising the steps of:

preprocessing the industry standard data into text data;

and splitting the text data into the article modules.

3. The method of claim 2, wherein in the step of preprocessing the industry standard data into text data, comprising: the picture information is described.

4. The method of claim 2, wherein in the step of splitting the text data into the article modules, comprising: and performing coding processing, redundant character processing, segmentation of complete semantic segments, lexical analysis and directory tree analysis on the text data.

5. The method of claim 1, wherein in the step of ChatGPT processing the article module into vector data, comprising:

6. A system for constructing a traffic, building industry standard knowledge base using ChatGPT, comprising:

the acquisition module is used for acquiring industry standard data;

the ChatGPT is used for processing the article module into vector data;

and the vector database is used for storing the vector data.

7. The system of claim 6, wherein the processing module comprises a preprocessing module and a splitting module;

8. The system of claim 7, wherein the preprocessing module comprises: the picture information is described.

9. The system of claim 7, wherein the splitting module comprises: and performing coding processing, redundant character processing, segmentation of complete semantic segments, lexical analysis and directory tree analysis on the text data.

10. The system of claim 6, wherein the ChatGPT comprises: