CN117076693A

CN117076693A - Method for constructing digital human teacher multi-mode large language model pre-training discipline corpus

Info

Publication number: CN117076693A
Application number: CN202310843136.8A
Authority: CN
Inventors: 刘三女牙; 周东波; 曾超勇; 李千千; 姚璜; 杨宗凯
Original assignee: Central China Normal University
Current assignee: Central China Normal University
Priority date: 2023-07-11
Filing date: 2023-07-11
Publication date: 2023-11-17

Abstract

The application discloses a method for constructing a digital human teacher multi-mode large language model pre-training subject corpus, which comprises the following steps: 1) Collecting subject-related multimodal data from sources such as various subject-related documents, textbooks, curriculum data, academic journals, websites, and the like; 2) Preprocessing the collected original corpus data; 3) Performing feature extraction and characterization learning on the multi-modal data based on the deep learning model; 4) Performing field self-adaption and fine adjustment on the pre-training model; 5) According to the corpus dividing result, organizing the corpus data into a structured corpus form so as to facilitate subsequent corpus retrieval and application; according to the method, the automatic method is adopted to construct the educational digital human subject corpus, so that the construction efficiency is improved, the labor cost is reduced, the research and application requirements of the field can be better met, and meanwhile, important technical support is provided for the development and application of the educational digital human subject.

Description

Method for constructing digital human teacher multi-mode large language model pre-training discipline corpus

Technical Field

The application relates to the technical field of artificial intelligence, and can be applied to scenes such as metauniverse, virtual digital people and the like, in particular to a method for constructing a digital human teacher multi-mode large language model pre-training discipline corpus.

Background

In the educational field, the application of digitization technology has become a trend, including the development of educational digital human disciplines (Educational Digital Humanities). The education digital human subjects cover the digital analysis and research of learning, education, teaching methods and resources to improve education quality and learning effect. For research and application in the field of education digital human subjects, it is important to construct a corpus specific to that field. Corpus refers to a resource that gathers and organizes large amounts of text data for research and application use. However, finding the appropriate corpus data and building a complete corpus within the field of education digital human disciplines is a challenging task.

In the prior art, the construction of educational digital human discipline corpus is often a time consuming and labor intensive process. Traditional methods involve manually collecting, labeling, and organizing corpus data, which requires a significant amount of time and expertise. In addition, due to the specificity of education digital human subjects, general corpus construction methods cannot meet specific requirements of the field.

Disclosure of Invention

The application relates to a method for constructing a digital human teacher multi-mode large language model pre-training discipline corpus, and aims to provide a high-efficiency, accurate and multi-mode corpus construction method for meeting discipline requirements of digital human teachers in the field of education. The method is used for constructing a discipline corpus suitable for digital human teachers through acquisition and processing of multi-mode data and application of a pre-training model. The method for constructing the multi-mode large model pre-training discipline corpus of the digital human teacher can provide more accurate and comprehensive discipline corpus resources and provide powerful support for learning, education and research of the digital human teacher. Meanwhile, the method can reduce the cost and labor investment for constructing the corpus and improve the utilization efficiency and sustainable development of the corpus.

Specifically, the application relates to a method for constructing a digital human teacher multi-mode large language model pre-training discipline corpus, which comprises the following steps:

(1) And (3) data collection: collecting subject-related multimodal data, including text, images, audio, from various subject-related documents, textbooks, curriculum materials, academic journals, websites, which data sources cover the relevant fields and content of education digital human subjects;

(2) Data preprocessing: preprocessing the collected data, namely processing text data, image data and audio data to prepare subsequent labeling and dividing work;

(3) Multimodal feature extraction and characterization learning: performing feature extraction and characterization learning on the multi-modal data based on the deep learning model;

(4) Model domain adaptation and fine tuning: performing field self-adaption and fine adjustment on the pre-training model according to the characteristics and requirements of disciplines so as to improve the performance and adaptability of the model on the task of the digital human teacher discipline;

(5) Corpus organization and management: according to the corpus data division result, the corpus data is organized into a structured corpus form so as to facilitate subsequent corpus retrieval and application.

Further, the step (1) of data collection includes the steps of:

1.1 collecting subject-related multimodal data, including text, images, and audio, from various subject-related documents, textbooks, curriculum materials, academic journals, and websites, providing the collected text data as sourcesThe source of the image data is->The audio data source is->The relevance of each source isWherein n is ₁ +n ₂ +n ₃ Representing the total number of data sources, c _i Representing data sources s _i J represents the index of all data sources Σjc _j Representing the sum of the correlations for all data sources, the weight for each source can be calculated by the following formula:

1.2 comprehensively considering the weight of each data source, and fusing the original corpus data of different sources by adopting a weighted average mode to obtain a primary corpusThe initial corpus data set is set as D= { D as the original corpus data collected from each source ₁ ,d ₂ ,...,d _n The initial corpus dataset is calculated by means of weighted averaging:

D _total ＝∑iW(s _i )*d _i wherein i is more than or equal to 1 and n is more than or equal to n

By the above formula, an initial corpus data set is calculated, wherein D _total Representing the final initial corpus data set.

Further, the step (2) of data preprocessing includes processing of text data, image data, audio data, including the steps of:

2.1 text data preprocessing

2.1.1 removal of punctuation: removing or replacing punctuation marks in the text with spaces by using a regular expression or a predefined punctuation mark list;

2.1.2 remove stop words: using the list of stop words in the natural language processing library NLTK (Natural Language Toolkit), the matching stop words (e.g., "the", "is", "and", etc.) are removed by traversing the words in the text and comparing to the list of stop words; using stop words ('englist') provided by NLTK to obtain English stop word list, and combining list derivation formula and condition judgment to realize stop word removal operation so as to reduce influence on analysis and modeling;

2.1.3 stem extraction or morphological reduction: processing words in the text by using a stem extractor (PorterStemmer) in the NLTK library, traversing the words in the text, and converting the words into original forms by applying corresponding stem extraction;

2.2 image data preprocessing

2.2.1 format conversion: converting the image data into PNG format of a specific format to maintain compatibility;

2.2.2 image enhancement: image processing technology is applied to the acquired images through an opencv library, and unified operations such as contrast enhancement, color correction, image smoothing and the like are mainly performed so as to improve the image quality and the discernability of visual features;

2.2.3 target detection and clipping: performing target detection and cutting by using a target detection algorithm YOLO, using a trained YOLO model, realizing target detection by calling a corresponding function, extracting the position information of the target, and cutting out a specific object or region by using a function in an image processing library according to the position information of the target;

2.3 Audio data Pre-processing

2.3.1 format conversion: converting the audio data into a specific MP3 format, and ensuring consistency and compatibility;

2.3.2 noise reduction treatment: applying a gaussian filtering algorithm to reduce the effect of background noise on the speech signal;

2.3.3 feature extraction: extracting Mel spectrum characteristics from the audio for subsequent voice analysis and modeling;

through the steps, the preprocessed multi-mode data can be obtained, and the preprocessed text data is set as D' _text The image data is D' _image The audio data is D' _audio 。

Further, the multi-modal feature extraction and characterization learning in step (3) includes the steps of:

in the multi-mode task, data (text, image and audio) of different modes contains rich information, and the semantics and visual characteristics of the data can be better captured by extracting the characteristics of the data and learning the characteristics of the data, so that the performance of the subsequent task is improved, and the specific steps are as follows:

3.1 text data feature extraction and characterization learning

An improved method is introduced for feature extraction and feature learning of text data, a text representation with more expressive power is obtained in combination with a pre-training language model and a self-attention mechanism, firstly, an embedded representation of the text data is obtained by the pre-training language model BERT, which is denoted as a text embedding matrix, denoted as text_emmbeddings,

text_embeddings＝BERT(D′ _text )

embedding a matrix text_webdings captures semantic information and context of the text;

further feature extraction is then performed on the text-embedding matrix using a self-attention mechanism SA, which allows each word in the text to interact with other words, automatically learn importance weights between the different words according to their relationships,

self_attention＝MHA(text_embeddings)

T＝SA(text_embeddings)

wherein self_attention represents the self-attention mechanism; MHA represents a multi-head attention mechanism model, SA represents a self-attention mechanism, and T represents a text feature representation after being processed by the self-attention mechanism;

3.2 image data feature extraction and characterization learning

After preprocessing image data, adopting Resnet as a convolutional neural network model to extract the characteristics of the image and learn the characteristics of the image, wherein Resnet comprises a plurality of basic blocks, each basic block comprises a series of convolutional layers, batch normalization and activation functions and is used for establishing a deep network structure, an attention mechanism is introduced into the basic blocks of Resnet and is used for capturing the associated information between different positions in the image, an attention module is added after the last convolutional layer of each basic block, and the input characteristic diagram of the basic block is assumed to be X, and the size is H multiplied by W multiplied by C;

3.2.1 feature map computation

Query feature map Q: the method comprises the steps of obtaining a query feature map Q by carrying out convolution operation on an input feature map, wherein the size of the query feature map Q is H multiplied by W multiplied by C';

key feature map K: the key feature map K is obtained through convolution operation on the input feature map, and the size is H multiplied by W multiplied by C';

value feature map V: the method comprises the steps of performing convolution operation on an input feature map to obtain a value feature map V, wherein the size of the value feature map V is H multiplied by W multiplied by C';

h and W represent the height and width of the feature map in the spatial dimension, C represents the number of channels of the feature map, and C' represents the number of channels of the new feature map calculated by the feature map;

3.2.2 similarity calculation:

for each pixel location (i, j), a query feature Q is calculated _i And key feature K _i Similarity scores between the two are obtained by using dot product operation to obtain a similarity matrix S, and the size is H multiplied by W:

S(i,j)＝Q _i ·K _i

3.2.3 similarity normalization

Normalizing the similarity matrix S, converting the similarity score into an attention weight a using a softmax function, with dimensions h×w:

A(i,j)＝softmax(S(i,j))

3.2.4 weighted fusion

Multiplying the attention weight A by the value feature map V, and summing the products to obtain an attention output feature map I of the basic block, wherein the size is H multiplied by W multiplied by C':

I(i,j)＝Σ(A(i,j)·V _i )

3.2.5 feature map transformation

The attention output profile I is converted by a convolution layer to the same number of channels C as the input profile X, to connect with subsequent basic blocks,

after the last convolution layer of each basic block, the attention module carries out self-adaptive feature weighting on the feature map through the processes of calculating similarity, normalizing and weighting fusion, so that the characterization capability of the basic block is enhanced;

3.3 Audio data feature extraction and characterization learning

In the characteristic extraction process of the convolutional neural network, mel frequency spectrum characteristics or MFCC coefficients are taken as input, characteristic extraction and dimension reduction are carried out through a convolutional layer and a pooling layer, the output of the convolutional layer is activated by using an activation function, the output of the last convolutional layer is taken as characteristic representation of audio data, and Mel frequency spectrum characteristics of the input audio data are represented as X _Mel The dimension is (m, n), where m represents the number of frames and n represents the feature dimension;

for the ith convolution kernel, the convolution operation may be expressed as:

C _i ＝f(X _Mel *w _i +b _i )

wherein C is _i Output feature map representing ith convolution kernel, tableShowing the convolution operation, w _i Weight parameter representing the ith convolution kernel, b _i A bias parameter representing an ith convolution kernel;

performing activation function processing on the convolution output to enhance the feature expression capability:

A _i ＝ReLU(C _i )

the output a of the last convolution layer is the characteristic representation of the audio data.

Further, the step (4) of model domain adaptation and fine tuning includes the steps of:

4.1 feature fusion

The features of different modes are fused to establish a multi-mode feature representation, a text feature representation is T, an image feature representation is I, a voice feature representation is A, and the feature fusion is carried out by adopting the following formula:

Fused_Feature＝Concatenate(T,I,A)

fused_feature represents the Fused multi-modal Feature representation, and Concate represents the spliced text, image and voice features;

4.2 Multi-task learning

Subject tasks include text question answers and image question answers, multitasking learning using the following formulas:

Loss＝λ ₁ *Loss _text +λ ₂ *Loss _image +λ ₃ *Loss _audio

considering the loss functions of text questions, image questions and audio answers, lambda ₁ 、λ ₂ And lambda (lambda) ₃ Is a weight parameter.

4.3 transfer learning

According to the characteristics and data distribution of the digital human teacher subject task, the model is self-adaptive, the supervised training or the countermeasure training technology is adopted, the labeling data of the digital human teacher subject task is self-adaptive in the field by adopting the following formula:

Loss＝Loss _pretrained +β*Loss _Specific

comprehensively considering the Loss of the pre-training model and the Loss of a specific subject task, loss represents the overall LossLoss of balance for model performance on pre-training and specific discipline tasks _pretrained Representing Loss of pre-trained model, loss _Specific Representing the loss of a particular discipline task, β is a weight parameter.

Further, the step (5) corpus organization and management includes the steps of:

5.1, establishing an index:

establishing an index is an important step for improving the retrieval efficiency of the corpus, and the index is a data structure used for quickly searching and positioning documents or corpus data in the corpus;

5.2 designing a query interface:

in order to facilitate the user to search the corpus, a query interface is designed, the query interface is based on keyword search, the user inputs keywords, the system returns related documents or corpus data containing the keywords, and a higher-level query interface is designed, so that the user's query intention can be understood by using a natural language processing technology based on semantic query, and more accurate results can be provided;

5.3 realizing the retrieval function:

according to the designed query interface, the retrieval function of the corpus is realized, the documents or the corpus data in the corpus are rapidly retrieved through the index and the query interface, and the result matched with the query condition is returned;

5.4 tissue corpus data:

according to the results and the requirements of division and classification, the corpus data is organized into a structured form, and a hierarchical structure, a directory structure or a label system is adopted for organization, so that a user can browse and access according to disciplines, topics and difficulty levels, and meanwhile metadata can be recorded on the corpus data, so that management and retrieval are convenient;

5.5 maintenance and update:

the regular maintenance and updating of the constructed corpus is an important step for maintaining timeliness and accuracy of the corpus, and the maintenance and updating operation comprises the steps of adding new corpus data, correcting errors or updating old data and adjusting classification; and meanwhile, continuous improvement and optimization are performed according to user feedback and requirements.

The application also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the steps of a method for constructing a digital human teacher multi-modal large language model pre-training discipline corpus as described above when the computer program is executed.

The present application also provides a computer-readable storage medium storing a computer program for causing a computer to execute the steps of a method of constructing a digital human teacher multi-modal large language model pre-training discipline corpus as described above.

The beneficial technical effects of the application are as follows:

1. according to the method for constructing the digital human teacher multi-mode large language model pre-training discipline corpus, the automatic method is adopted to construct the education digital human discipline corpus, so that the construction efficiency is improved, the cost and the labor investment for constructing the corpus are reduced, and the utilization efficiency and the sustainable development of the corpus are improved.

2. The method for constructing the digital human teacher multi-mode large language model pre-training discipline corpus can provide more accurate and comprehensive discipline corpus resources and provide powerful support for learning, education and research of digital human teachers.

3. The method for constructing the discipline corpus of the digital human teacher multi-mode large language model pre-training discipline, disclosed by the application, is used for constructing the discipline corpus specific to the education digital human discipline, can better meet the research and application requirements of the field, and simultaneously provides important technical support for the development and application of the education digital human discipline.

Drawings

FIG. 1 is a flow chart of a method for constructing a digital human teacher multi-modal large language model pre-training discipline corpus in accordance with the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solution of the present application will be clearly and completely described with reference to fig. 1 of the present application.

The application provides a method for constructing a digital human teacher multi-mode large language model pre-training subject corpus, which comprises the following steps:

Further, the step (1) of data collection includes the steps of:

1.3 collecting multi-mode data related to subjects from various documents, teaching materials, course materials, academic journals and websites, wherein the data collecting modes comprise web crawlers, database inquiry, text mining, information extraction and the like, and specific automatic collecting methods and technologies can be different according to the actual requirements and the characteristics of data sources. Let the collected text data source beThe source of the image data is->The audio data source is->The relativity of the sources is +.>Wherein n is ₁ +n ₂ +n ₃ Representing the total number of data sources, c _i Representing data sources s _i J represents the index of all data sources Σjc _j Representing the sum of the correlations for all data sources, the weight for each source can be calculated by the following formula:

wherein i is more than or equal to 1 and less than or equal to n, j is more than or equal to 1 and less than or equal to n

1.4 comprehensively considering the weight of each data source, fusing the original corpus data of different sources by adopting a weighted average mode to obtain an initial corpus data set, and setting the original corpus data collected from each source as D= { D ₁ ,d ₂ ,...,d _n The initial corpus dataset is calculated by means of weighted averaging:

2.1 text data preprocessing

2.2 image data preprocessing

2.3 Audio data Pre-processing

3.1 text data feature extraction and characterization learning

text_embeddings＝BERT(D′ _text )

self_attention＝MHA(text_embeddings)

T＝SA(text_embeddings)

3.2 image data feature extraction and characterization learning

3.2.1 feature map computation

3.2.2 similarity calculation:

S(i,j)＝Q _i ·K _i

3.2.3 similarity normalization

A(i,j)＝softmax(S(i,j))

3.2.4 weighted fusion

I(i,j)＝Σ(A(i,j)·V _i )

3.2.5 feature map transformation

3.3 Audio data feature extraction and characterization learning

for the ith convolution kernel, the convolution operation may be expressed as:

C _i ＝f(X _Mel *w _i +b _i )

wherein C is _i Output feature map representing the ith convolution kernel, representing convolution operation, w _i Weight parameter representing the ith convolution kernel, b _i A bias parameter representing an ith convolution kernel;

A _i ＝ReLU(C _i )

4.1 feature fusion

Fused_Feature＝Concatenate(T,I,A)

4.2 Multi-task learning

Loss＝λ ₁ *Loss _text +λ ₂ *Loss _image +λ ₃ *Loss _audio

4.3 transfer learning

Loss＝Loss _pretrained +β*Loss _Specific

comprehensively considering the Loss of the pre-training model and the Loss of the specific subject task, the Loss represents the total Loss and is used for measuring the performance of the model on the pre-training and the specific subject task _pretrained Representing Loss of pre-trained model, loss _Specific Representing the loss of a particular discipline task, β is a weight parameter.

Further, the step (5) corpus organization and management includes the steps of:

5.1, establishing an index:

5.2 designing a query interface:

5.3 realizing the retrieval function:

5.4 tissue corpus data:

5.5 maintenance and update:

The above description is only of the preferred embodiments of the present application and is not intended to limit the present application, and various modifications and variations can be made to the embodiments of the present application by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method for constructing a digital human teacher multi-mode large language model pre-training subject corpus is characterized by comprising the following steps of: the method comprises the following steps:

2. The method for constructing the digital human teacher multi-modal large language model pre-training discipline corpus according to claim 1, wherein the method comprises the following steps of: the step (1) of data collection comprises the following steps:

1.1 collecting subject-related multimodal data, including text, images, and audio, from various subject-related documents, textbooks, curriculum materials, academic journals, and websites, providing the collected text data as sourcesThe image data is derived fromThe audio data source is->The relevance of each source isWherein n is ₁ +n ₂ +n ₃ Representing the total number of data sources, c _i Representing data sources s _i J represents the index of all data sources Σjc _j Representing the sum of the correlations for all data sources, the weight for each source can be calculated by the following formula:

1.2 comprehensively considering the weight of each data source, fusing the original corpus data of different sources by adopting a weighted average mode to obtain an initial corpus data set, and setting the original corpus data collected from each source as D= { D ₁ ,d ₂ ,...,d _n The initial corpus dataset is calculated by means of weighted averaging:

3. The method for constructing the digital human teacher multi-modal large language model pre-training discipline corpus according to claim 1, wherein the method comprises the following steps of: the step (2) of data preprocessing includes processing of text data, image data and audio data, and comprises the following steps:

2.1 text data preprocessing

2.1.2 remove stop words: removing the matched stop word by traversing the words in the text and comparing with the stop word list using the stop word list in the natural language processing library NLTK; using stop words ('englist') provided by NLTK to obtain English stop word list, and combining list derivation formula and condition judgment to realize stop word removal operation so as to reduce influence on analysis and modeling;

2.1.3 stem extraction or morphological reduction: processing words in the text by using a stem extractor in an NLTK library, traversing the words in the text, and converting the words into original forms by applying corresponding stem extraction;

2.2 image data preprocessing

2.2.2 image enhancement: image processing technology is applied to the acquired images through an opencv library, and contrast enhancement, color correction and image smoothing are mainly performed in a unified mode, so that the image quality and the distinguishability of visual features are improved;

2.3 Audio data Pre-processing

through the steps, the preprocessed multi-mode data can be obtained, and the preprocessed text data is set as D _t ′ _ext The image data is D _i ′ _mage The audio data is D _a ′ _udio 。

4. The method for constructing the digital human teacher multi-modal large language model pre-training discipline corpus according to claim 1, wherein the method comprises the following steps of: the multi-modal feature extraction and characterization learning of step (3) includes the steps of:

in the multi-mode task, the data of different modes contains rich information, and the semantics and visual characteristics of the data can be better captured by extracting the characteristics of the data and learning the characteristics of the data, so that the performance of the subsequent task is improved, and the specific steps are as follows:

3.1 text data feature extraction and characterization learning

text_embeddings＝BERT(D′ _text )

self_attention＝MHA(text_embeddings)

T＝SA(text_embeddings)

wherein self_attention represents the self-attention mechanism model; MHA represents a multi-head attention mechanism model, SA represents a self-attention mechanism, and T represents a text feature representation after being processed by the self-attention mechanism;

3.2 image data feature extraction and characterization learning

3.2.1 feature map computation

3.2.2 similarity calculation:

S(i,j)＝Q _i ·K _i

3.2.3 similarity normalization

A(i,j)＝softmax(S(i,j))

3.2.4 weighted fusion

I(i,j)＝Σ(A(i,j)·V _i )

3.2.5 feature map transformation

3.3 Audio data feature extraction and characterization learning

for the ith convolution kernel, the convolution operation may be expressed as:

C _i ＝f(X _Mel *w _i +b _i )

A _i ＝ReLU(C _i )

5. The method for constructing the digital human teacher multi-modal large language model pre-training discipline corpus according to claim 1, wherein the method comprises the following steps of: step (4) model domain adaptation and fine tuning includes the steps of:

4.1 feature fusion

Fused_Feature＝Concatenate(T,I,A)

4.2 Multi-task learning

Loss＝λ ₁ *Loss _text +λ ₂ *Loss _image +λ ₃ *Loss _audio

4.3 transfer learning

Loss＝Loss _pretrained +β*Loss _Specific

6. The method for constructing the digital human teacher multi-modal large language model pre-training discipline corpus according to claim 1, wherein the method comprises the following steps of: the corpus organization and management in the step (5) comprises the following steps:

5.1, establishing an index:

5.2 designing a query interface:

5.3 realizing the retrieval function:

5.4 tissue corpus data:

5.5 maintenance and update:

7. An electronic device, characterized in that: comprising a memory, a processor and a computer program stored on the memory and executable on the processor, said processor executing the steps of a method for constructing a digital human teacher multi-modal large language model pre-training discipline corpus as claimed in any of claims 1 to 6.

8. A computer-readable storage medium storing a computer program, wherein the computer program causes a computer to perform the steps of a method for constructing a digital human teacher multi-modal large language model pre-training discipline corpus according to any one of claims 1 to 6.