CN117076693A - Method for constructing digital human teacher multi-mode large language model pre-training discipline corpus - Google Patents

Method for constructing digital human teacher multi-mode large language model pre-training discipline corpus Download PDF

Info

Publication number
CN117076693A
CN117076693A CN202310843136.8A CN202310843136A CN117076693A CN 117076693 A CN117076693 A CN 117076693A CN 202310843136 A CN202310843136 A CN 202310843136A CN 117076693 A CN117076693 A CN 117076693A
Authority
CN
China
Prior art keywords
data
corpus
text
feature
feature map
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310843136.8A
Other languages
Chinese (zh)
Inventor
刘三女牙
周东波
曾超勇
李千千
姚璜
杨宗凯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Central China Normal University
Original Assignee
Central China Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Central China Normal University filed Critical Central China Normal University
Priority to CN202310843136.8A priority Critical patent/CN117076693A/en
Publication of CN117076693A publication Critical patent/CN117076693A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/41Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/45Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/48Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/483Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/09Supervised learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/094Adversarial learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/096Transfer learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/20Education

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Business, Economics & Management (AREA)
  • Databases & Information Systems (AREA)
  • Multimedia (AREA)
  • Library & Information Science (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Tourism & Hospitality (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Educational Technology (AREA)
  • Educational Administration (AREA)
  • Probability & Statistics with Applications (AREA)
  • Economics (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Machine Translation (AREA)

Abstract

The application discloses a method for constructing a digital human teacher multi-mode large language model pre-training subject corpus, which comprises the following steps: 1) Collecting subject-related multimodal data from sources such as various subject-related documents, textbooks, curriculum data, academic journals, websites, and the like; 2) Preprocessing the collected original corpus data; 3) Performing feature extraction and characterization learning on the multi-modal data based on the deep learning model; 4) Performing field self-adaption and fine adjustment on the pre-training model; 5) According to the corpus dividing result, organizing the corpus data into a structured corpus form so as to facilitate subsequent corpus retrieval and application; according to the method, the automatic method is adopted to construct the educational digital human subject corpus, so that the construction efficiency is improved, the labor cost is reduced, the research and application requirements of the field can be better met, and meanwhile, important technical support is provided for the development and application of the educational digital human subject.

Description

Method for constructing digital human teacher multi-mode large language model pre-training discipline corpus
Technical Field
The application relates to the technical field of artificial intelligence, and can be applied to scenes such as metauniverse, virtual digital people and the like, in particular to a method for constructing a digital human teacher multi-mode large language model pre-training discipline corpus.
Background
In the educational field, the application of digitization technology has become a trend, including the development of educational digital human disciplines (Educational Digital Humanities). The education digital human subjects cover the digital analysis and research of learning, education, teaching methods and resources to improve education quality and learning effect. For research and application in the field of education digital human subjects, it is important to construct a corpus specific to that field. Corpus refers to a resource that gathers and organizes large amounts of text data for research and application use. However, finding the appropriate corpus data and building a complete corpus within the field of education digital human disciplines is a challenging task.
In the prior art, the construction of educational digital human discipline corpus is often a time consuming and labor intensive process. Traditional methods involve manually collecting, labeling, and organizing corpus data, which requires a significant amount of time and expertise. In addition, due to the specificity of education digital human subjects, general corpus construction methods cannot meet specific requirements of the field.
Disclosure of Invention
The application relates to a method for constructing a digital human teacher multi-mode large language model pre-training discipline corpus, and aims to provide a high-efficiency, accurate and multi-mode corpus construction method for meeting discipline requirements of digital human teachers in the field of education. The method is used for constructing a discipline corpus suitable for digital human teachers through acquisition and processing of multi-mode data and application of a pre-training model. The method for constructing the multi-mode large model pre-training discipline corpus of the digital human teacher can provide more accurate and comprehensive discipline corpus resources and provide powerful support for learning, education and research of the digital human teacher. Meanwhile, the method can reduce the cost and labor investment for constructing the corpus and improve the utilization efficiency and sustainable development of the corpus.
Specifically, the application relates to a method for constructing a digital human teacher multi-mode large language model pre-training discipline corpus, which comprises the following steps:
(1) And (3) data collection: collecting subject-related multimodal data, including text, images, audio, from various subject-related documents, textbooks, curriculum materials, academic journals, websites, which data sources cover the relevant fields and content of education digital human subjects;
(2) Data preprocessing: preprocessing the collected data, namely processing text data, image data and audio data to prepare subsequent labeling and dividing work;
(3) Multimodal feature extraction and characterization learning: performing feature extraction and characterization learning on the multi-modal data based on the deep learning model;
(4) Model domain adaptation and fine tuning: performing field self-adaption and fine adjustment on the pre-training model according to the characteristics and requirements of disciplines so as to improve the performance and adaptability of the model on the task of the digital human teacher discipline;
(5) Corpus organization and management: according to the corpus data division result, the corpus data is organized into a structured corpus form so as to facilitate subsequent corpus retrieval and application.
Further, the step (1) of data collection includes the steps of:
1.1 collecting subject-related multimodal data, including text, images, and audio, from various subject-related documents, textbooks, curriculum materials, academic journals, and websites, providing the collected text data as sourcesThe source of the image data is->The audio data source is->The relevance of each source isWherein n is 1 +n 2 +n 3 Representing the total number of data sources, c i Representing data sources s i J represents the index of all data sources Σjc j Representing the sum of the correlations for all data sources, the weight for each source can be calculated by the following formula:
1.2 comprehensively considering the weight of each data source, and fusing the original corpus data of different sources by adopting a weighted average mode to obtain a primary corpusThe initial corpus data set is set as D= { D as the original corpus data collected from each source 1 ,d 2 ,...,d n The initial corpus dataset is calculated by means of weighted averaging:
D total =∑iW(s i )*d i wherein i is more than or equal to 1 and n is more than or equal to n
By the above formula, an initial corpus data set is calculated, wherein D total Representing the final initial corpus data set.
Further, the step (2) of data preprocessing includes processing of text data, image data, audio data, including the steps of:
2.1 text data preprocessing
2.1.1 removal of punctuation: removing or replacing punctuation marks in the text with spaces by using a regular expression or a predefined punctuation mark list;
2.1.2 remove stop words: using the list of stop words in the natural language processing library NLTK (Natural Language Toolkit), the matching stop words (e.g., "the", "is", "and", etc.) are removed by traversing the words in the text and comparing to the list of stop words; using stop words ('englist') provided by NLTK to obtain English stop word list, and combining list derivation formula and condition judgment to realize stop word removal operation so as to reduce influence on analysis and modeling;
2.1.3 stem extraction or morphological reduction: processing words in the text by using a stem extractor (PorterStemmer) in the NLTK library, traversing the words in the text, and converting the words into original forms by applying corresponding stem extraction;
2.2 image data preprocessing
2.2.1 format conversion: converting the image data into PNG format of a specific format to maintain compatibility;
2.2.2 image enhancement: image processing technology is applied to the acquired images through an opencv library, and unified operations such as contrast enhancement, color correction, image smoothing and the like are mainly performed so as to improve the image quality and the discernability of visual features;
2.2.3 target detection and clipping: performing target detection and cutting by using a target detection algorithm YOLO, using a trained YOLO model, realizing target detection by calling a corresponding function, extracting the position information of the target, and cutting out a specific object or region by using a function in an image processing library according to the position information of the target;
2.3 Audio data Pre-processing
2.3.1 format conversion: converting the audio data into a specific MP3 format, and ensuring consistency and compatibility;
2.3.2 noise reduction treatment: applying a gaussian filtering algorithm to reduce the effect of background noise on the speech signal;
2.3.3 feature extraction: extracting Mel spectrum characteristics from the audio for subsequent voice analysis and modeling;
through the steps, the preprocessed multi-mode data can be obtained, and the preprocessed text data is set as D' text The image data is D' image The audio data is D' audio
Further, the multi-modal feature extraction and characterization learning in step (3) includes the steps of:
in the multi-mode task, data (text, image and audio) of different modes contains rich information, and the semantics and visual characteristics of the data can be better captured by extracting the characteristics of the data and learning the characteristics of the data, so that the performance of the subsequent task is improved, and the specific steps are as follows:
3.1 text data feature extraction and characterization learning
An improved method is introduced for feature extraction and feature learning of text data, a text representation with more expressive power is obtained in combination with a pre-training language model and a self-attention mechanism, firstly, an embedded representation of the text data is obtained by the pre-training language model BERT, which is denoted as a text embedding matrix, denoted as text_emmbeddings,
text_embeddings=BERT(D′ text )
embedding a matrix text_webdings captures semantic information and context of the text;
further feature extraction is then performed on the text-embedding matrix using a self-attention mechanism SA, which allows each word in the text to interact with other words, automatically learn importance weights between the different words according to their relationships,
self_attention=MHA(text_embeddings)
T=SA(text_embeddings)
wherein self_attention represents the self-attention mechanism; MHA represents a multi-head attention mechanism model, SA represents a self-attention mechanism, and T represents a text feature representation after being processed by the self-attention mechanism;
3.2 image data feature extraction and characterization learning
After preprocessing image data, adopting Resnet as a convolutional neural network model to extract the characteristics of the image and learn the characteristics of the image, wherein Resnet comprises a plurality of basic blocks, each basic block comprises a series of convolutional layers, batch normalization and activation functions and is used for establishing a deep network structure, an attention mechanism is introduced into the basic blocks of Resnet and is used for capturing the associated information between different positions in the image, an attention module is added after the last convolutional layer of each basic block, and the input characteristic diagram of the basic block is assumed to be X, and the size is H multiplied by W multiplied by C;
3.2.1 feature map computation
Query feature map Q: the method comprises the steps of obtaining a query feature map Q by carrying out convolution operation on an input feature map, wherein the size of the query feature map Q is H multiplied by W multiplied by C';
key feature map K: the key feature map K is obtained through convolution operation on the input feature map, and the size is H multiplied by W multiplied by C';
value feature map V: the method comprises the steps of performing convolution operation on an input feature map to obtain a value feature map V, wherein the size of the value feature map V is H multiplied by W multiplied by C';
h and W represent the height and width of the feature map in the spatial dimension, C represents the number of channels of the feature map, and C' represents the number of channels of the new feature map calculated by the feature map;
3.2.2 similarity calculation:
for each pixel location (i, j), a query feature Q is calculated i And key feature K i Similarity scores between the two are obtained by using dot product operation to obtain a similarity matrix S, and the size is H multiplied by W:
S(i,j)=Q i ·K i
3.2.3 similarity normalization
Normalizing the similarity matrix S, converting the similarity score into an attention weight a using a softmax function, with dimensions h×w:
A(i,j)=softmax(S(i,j))
3.2.4 weighted fusion
Multiplying the attention weight A by the value feature map V, and summing the products to obtain an attention output feature map I of the basic block, wherein the size is H multiplied by W multiplied by C':
I(i,j)=Σ(A(i,j)·V i )
3.2.5 feature map transformation
The attention output profile I is converted by a convolution layer to the same number of channels C as the input profile X, to connect with subsequent basic blocks,
after the last convolution layer of each basic block, the attention module carries out self-adaptive feature weighting on the feature map through the processes of calculating similarity, normalizing and weighting fusion, so that the characterization capability of the basic block is enhanced;
3.3 Audio data feature extraction and characterization learning
In the characteristic extraction process of the convolutional neural network, mel frequency spectrum characteristics or MFCC coefficients are taken as input, characteristic extraction and dimension reduction are carried out through a convolutional layer and a pooling layer, the output of the convolutional layer is activated by using an activation function, the output of the last convolutional layer is taken as characteristic representation of audio data, and Mel frequency spectrum characteristics of the input audio data are represented as X Mel The dimension is (m, n), where m represents the number of frames and n represents the feature dimension;
for the ith convolution kernel, the convolution operation may be expressed as:
C i =f(X Mel *w i +b i )
wherein C is i Output feature map representing ith convolution kernel, tableShowing the convolution operation, w i Weight parameter representing the ith convolution kernel, b i A bias parameter representing an ith convolution kernel;
performing activation function processing on the convolution output to enhance the feature expression capability:
A i =ReLU(C i )
the output a of the last convolution layer is the characteristic representation of the audio data.
Further, the step (4) of model domain adaptation and fine tuning includes the steps of:
4.1 feature fusion
The features of different modes are fused to establish a multi-mode feature representation, a text feature representation is T, an image feature representation is I, a voice feature representation is A, and the feature fusion is carried out by adopting the following formula:
Fused_Feature=Concatenate(T,I,A)
fused_feature represents the Fused multi-modal Feature representation, and Concate represents the spliced text, image and voice features;
4.2 Multi-task learning
Subject tasks include text question answers and image question answers, multitasking learning using the following formulas:
Loss=λ 1 *Loss text2 *Loss image3 *Loss audio
considering the loss functions of text questions, image questions and audio answers, lambda 1 、λ 2 And lambda (lambda) 3 Is a weight parameter.
4.3 transfer learning
According to the characteristics and data distribution of the digital human teacher subject task, the model is self-adaptive, the supervised training or the countermeasure training technology is adopted, the labeling data of the digital human teacher subject task is self-adaptive in the field by adopting the following formula:
Loss=Loss pretrained +β*Loss Specific
comprehensively considering the Loss of the pre-training model and the Loss of a specific subject task, loss represents the overall LossLoss of balance for model performance on pre-training and specific discipline tasks pretrained Representing Loss of pre-trained model, loss Specific Representing the loss of a particular discipline task, β is a weight parameter.
Further, the step (5) corpus organization and management includes the steps of:
5.1, establishing an index:
establishing an index is an important step for improving the retrieval efficiency of the corpus, and the index is a data structure used for quickly searching and positioning documents or corpus data in the corpus;
5.2 designing a query interface:
in order to facilitate the user to search the corpus, a query interface is designed, the query interface is based on keyword search, the user inputs keywords, the system returns related documents or corpus data containing the keywords, and a higher-level query interface is designed, so that the user's query intention can be understood by using a natural language processing technology based on semantic query, and more accurate results can be provided;
5.3 realizing the retrieval function:
according to the designed query interface, the retrieval function of the corpus is realized, the documents or the corpus data in the corpus are rapidly retrieved through the index and the query interface, and the result matched with the query condition is returned;
5.4 tissue corpus data:
according to the results and the requirements of division and classification, the corpus data is organized into a structured form, and a hierarchical structure, a directory structure or a label system is adopted for organization, so that a user can browse and access according to disciplines, topics and difficulty levels, and meanwhile metadata can be recorded on the corpus data, so that management and retrieval are convenient;
5.5 maintenance and update:
the regular maintenance and updating of the constructed corpus is an important step for maintaining timeliness and accuracy of the corpus, and the maintenance and updating operation comprises the steps of adding new corpus data, correcting errors or updating old data and adjusting classification; and meanwhile, continuous improvement and optimization are performed according to user feedback and requirements.
The application also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the steps of a method for constructing a digital human teacher multi-modal large language model pre-training discipline corpus as described above when the computer program is executed.
The present application also provides a computer-readable storage medium storing a computer program for causing a computer to execute the steps of a method of constructing a digital human teacher multi-modal large language model pre-training discipline corpus as described above.
The beneficial technical effects of the application are as follows:
1. according to the method for constructing the digital human teacher multi-mode large language model pre-training discipline corpus, the automatic method is adopted to construct the education digital human discipline corpus, so that the construction efficiency is improved, the cost and the labor investment for constructing the corpus are reduced, and the utilization efficiency and the sustainable development of the corpus are improved.
2. The method for constructing the digital human teacher multi-mode large language model pre-training discipline corpus can provide more accurate and comprehensive discipline corpus resources and provide powerful support for learning, education and research of digital human teachers.
3. The method for constructing the discipline corpus of the digital human teacher multi-mode large language model pre-training discipline, disclosed by the application, is used for constructing the discipline corpus specific to the education digital human discipline, can better meet the research and application requirements of the field, and simultaneously provides important technical support for the development and application of the education digital human discipline.
Drawings
FIG. 1 is a flow chart of a method for constructing a digital human teacher multi-modal large language model pre-training discipline corpus in accordance with the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solution of the present application will be clearly and completely described with reference to fig. 1 of the present application.
The application provides a method for constructing a digital human teacher multi-mode large language model pre-training subject corpus, which comprises the following steps:
(1) And (3) data collection: collecting subject-related multimodal data, including text, images, audio, from various subject-related documents, textbooks, curriculum materials, academic journals, websites, which data sources cover the relevant fields and content of education digital human subjects;
(2) Data preprocessing: preprocessing the collected data, namely processing text data, image data and audio data to prepare subsequent labeling and dividing work;
(3) Multimodal feature extraction and characterization learning: performing feature extraction and characterization learning on the multi-modal data based on the deep learning model;
(4) Model domain adaptation and fine tuning: performing field self-adaption and fine adjustment on the pre-training model according to the characteristics and requirements of disciplines so as to improve the performance and adaptability of the model on the task of the digital human teacher discipline;
(5) Corpus organization and management: according to the corpus data division result, the corpus data is organized into a structured corpus form so as to facilitate subsequent corpus retrieval and application.
Further, the step (1) of data collection includes the steps of:
1.3 collecting multi-mode data related to subjects from various documents, teaching materials, course materials, academic journals and websites, wherein the data collecting modes comprise web crawlers, database inquiry, text mining, information extraction and the like, and specific automatic collecting methods and technologies can be different according to the actual requirements and the characteristics of data sources. Let the collected text data source beThe source of the image data is->The audio data source is->The relativity of the sources is +.>Wherein n is 1 +n 2 +n 3 Representing the total number of data sources, c i Representing data sources s i J represents the index of all data sources Σjc j Representing the sum of the correlations for all data sources, the weight for each source can be calculated by the following formula:
wherein i is more than or equal to 1 and less than or equal to n, j is more than or equal to 1 and less than or equal to n
1.4 comprehensively considering the weight of each data source, fusing the original corpus data of different sources by adopting a weighted average mode to obtain an initial corpus data set, and setting the original corpus data collected from each source as D= { D 1 ,d 2 ,...,d n The initial corpus dataset is calculated by means of weighted averaging:
D total =∑iW(s i )*d i wherein i is more than or equal to 1 and n is more than or equal to n
By the above formula, an initial corpus data set is calculated, wherein D total Representing the final initial corpus data set.
Further, the step (2) of data preprocessing includes processing of text data, image data, audio data, including the steps of:
2.1 text data preprocessing
2.1.1 removal of punctuation: removing or replacing punctuation marks in the text with spaces by using a regular expression or a predefined punctuation mark list;
2.1.2 remove stop words: using the list of stop words in the natural language processing library NLTK (Natural Language Toolkit), the matching stop words (e.g., "the", "is", "and", etc.) are removed by traversing the words in the text and comparing to the list of stop words; using stop words ('englist') provided by NLTK to obtain English stop word list, and combining list derivation formula and condition judgment to realize stop word removal operation so as to reduce influence on analysis and modeling;
2.1.3 stem extraction or morphological reduction: processing words in the text by using a stem extractor (PorterStemmer) in the NLTK library, traversing the words in the text, and converting the words into original forms by applying corresponding stem extraction;
2.2 image data preprocessing
2.2.1 format conversion: converting the image data into PNG format of a specific format to maintain compatibility;
2.2.2 image enhancement: image processing technology is applied to the acquired images through an opencv library, and unified operations such as contrast enhancement, color correction, image smoothing and the like are mainly performed so as to improve the image quality and the discernability of visual features;
2.2.3 target detection and clipping: performing target detection and cutting by using a target detection algorithm YOLO, using a trained YOLO model, realizing target detection by calling a corresponding function, extracting the position information of the target, and cutting out a specific object or region by using a function in an image processing library according to the position information of the target;
2.3 Audio data Pre-processing
2.3.1 format conversion: converting the audio data into a specific MP3 format, and ensuring consistency and compatibility;
2.3.2 noise reduction treatment: applying a gaussian filtering algorithm to reduce the effect of background noise on the speech signal;
2.3.3 feature extraction: extracting Mel spectrum characteristics from the audio for subsequent voice analysis and modeling;
through the steps, the preprocessed multi-mode data can be obtained, and the preprocessed text data is set as D' text The image data is D' image The audio data is D' audio
Further, the multi-modal feature extraction and characterization learning in step (3) includes the steps of:
in the multi-mode task, data (text, image and audio) of different modes contains rich information, and the semantics and visual characteristics of the data can be better captured by extracting the characteristics of the data and learning the characteristics of the data, so that the performance of the subsequent task is improved, and the specific steps are as follows:
3.1 text data feature extraction and characterization learning
An improved method is introduced for feature extraction and feature learning of text data, a text representation with more expressive power is obtained in combination with a pre-training language model and a self-attention mechanism, firstly, an embedded representation of the text data is obtained by the pre-training language model BERT, which is denoted as a text embedding matrix, denoted as text_emmbeddings,
text_embeddings=BERT(D′ text )
embedding a matrix text_webdings captures semantic information and context of the text;
further feature extraction is then performed on the text-embedding matrix using a self-attention mechanism SA, which allows each word in the text to interact with other words, automatically learn importance weights between the different words according to their relationships,
self_attention=MHA(text_embeddings)
T=SA(text_embeddings)
wherein self_attention represents the self-attention mechanism; MHA represents a multi-head attention mechanism model, SA represents a self-attention mechanism, and T represents a text feature representation after being processed by the self-attention mechanism;
3.2 image data feature extraction and characterization learning
After preprocessing image data, adopting Resnet as a convolutional neural network model to extract the characteristics of the image and learn the characteristics of the image, wherein Resnet comprises a plurality of basic blocks, each basic block comprises a series of convolutional layers, batch normalization and activation functions and is used for establishing a deep network structure, an attention mechanism is introduced into the basic blocks of Resnet and is used for capturing the associated information between different positions in the image, an attention module is added after the last convolutional layer of each basic block, and the input characteristic diagram of the basic block is assumed to be X, and the size is H multiplied by W multiplied by C;
3.2.1 feature map computation
Query feature map Q: the method comprises the steps of obtaining a query feature map Q by carrying out convolution operation on an input feature map, wherein the size of the query feature map Q is H multiplied by W multiplied by C';
key feature map K: the key feature map K is obtained through convolution operation on the input feature map, and the size is H multiplied by W multiplied by C';
value feature map V: the method comprises the steps of performing convolution operation on an input feature map to obtain a value feature map V, wherein the size of the value feature map V is H multiplied by W multiplied by C';
h and W represent the height and width of the feature map in the spatial dimension, C represents the number of channels of the feature map, and C' represents the number of channels of the new feature map calculated by the feature map;
3.2.2 similarity calculation:
for each pixel location (i, j), a query feature Q is calculated i And key feature K i Similarity scores between the two are obtained by using dot product operation to obtain a similarity matrix S, and the size is H multiplied by W:
S(i,j)=Q i ·K i
3.2.3 similarity normalization
Normalizing the similarity matrix S, converting the similarity score into an attention weight a using a softmax function, with dimensions h×w:
A(i,j)=softmax(S(i,j))
3.2.4 weighted fusion
Multiplying the attention weight A by the value feature map V, and summing the products to obtain an attention output feature map I of the basic block, wherein the size is H multiplied by W multiplied by C':
I(i,j)=Σ(A(i,j)·V i )
3.2.5 feature map transformation
The attention output profile I is converted by a convolution layer to the same number of channels C as the input profile X, to connect with subsequent basic blocks,
after the last convolution layer of each basic block, the attention module carries out self-adaptive feature weighting on the feature map through the processes of calculating similarity, normalizing and weighting fusion, so that the characterization capability of the basic block is enhanced;
3.3 Audio data feature extraction and characterization learning
In the characteristic extraction process of the convolutional neural network, mel frequency spectrum characteristics or MFCC coefficients are taken as input, characteristic extraction and dimension reduction are carried out through a convolutional layer and a pooling layer, the output of the convolutional layer is activated by using an activation function, the output of the last convolutional layer is taken as characteristic representation of audio data, and Mel frequency spectrum characteristics of the input audio data are represented as X Mel The dimension is (m, n), where m represents the number of frames and n represents the feature dimension;
for the ith convolution kernel, the convolution operation may be expressed as:
C i =f(X Mel *w i +b i )
wherein C is i Output feature map representing the ith convolution kernel, representing convolution operation, w i Weight parameter representing the ith convolution kernel, b i A bias parameter representing an ith convolution kernel;
performing activation function processing on the convolution output to enhance the feature expression capability:
A i =ReLU(C i )
the output a of the last convolution layer is the characteristic representation of the audio data.
Further, the step (4) of model domain adaptation and fine tuning includes the steps of:
4.1 feature fusion
The features of different modes are fused to establish a multi-mode feature representation, a text feature representation is T, an image feature representation is I, a voice feature representation is A, and the feature fusion is carried out by adopting the following formula:
Fused_Feature=Concatenate(T,I,A)
fused_feature represents the Fused multi-modal Feature representation, and Concate represents the spliced text, image and voice features;
4.2 Multi-task learning
Subject tasks include text question answers and image question answers, multitasking learning using the following formulas:
Loss=λ 1 *Loss text2 *Loss image3 *Loss audio
considering the loss functions of text questions, image questions and audio answers, lambda 1 、λ 2 And lambda (lambda) 3 Is a weight parameter.
4.3 transfer learning
According to the characteristics and data distribution of the digital human teacher subject task, the model is self-adaptive, the supervised training or the countermeasure training technology is adopted, the labeling data of the digital human teacher subject task is self-adaptive in the field by adopting the following formula:
Loss=Loss pretrained +β*Loss Specific
comprehensively considering the Loss of the pre-training model and the Loss of the specific subject task, the Loss represents the total Loss and is used for measuring the performance of the model on the pre-training and the specific subject task pretrained Representing Loss of pre-trained model, loss Specific Representing the loss of a particular discipline task, β is a weight parameter.
Further, the step (5) corpus organization and management includes the steps of:
5.1, establishing an index:
establishing an index is an important step for improving the retrieval efficiency of the corpus, and the index is a data structure used for quickly searching and positioning documents or corpus data in the corpus;
5.2 designing a query interface:
in order to facilitate the user to search the corpus, a query interface is designed, the query interface is based on keyword search, the user inputs keywords, the system returns related documents or corpus data containing the keywords, and a higher-level query interface is designed, so that the user's query intention can be understood by using a natural language processing technology based on semantic query, and more accurate results can be provided;
5.3 realizing the retrieval function:
according to the designed query interface, the retrieval function of the corpus is realized, the documents or the corpus data in the corpus are rapidly retrieved through the index and the query interface, and the result matched with the query condition is returned;
5.4 tissue corpus data:
according to the results and the requirements of division and classification, the corpus data is organized into a structured form, and a hierarchical structure, a directory structure or a label system is adopted for organization, so that a user can browse and access according to disciplines, topics and difficulty levels, and meanwhile metadata can be recorded on the corpus data, so that management and retrieval are convenient;
5.5 maintenance and update:
the regular maintenance and updating of the constructed corpus is an important step for maintaining timeliness and accuracy of the corpus, and the maintenance and updating operation comprises the steps of adding new corpus data, correcting errors or updating old data and adjusting classification; and meanwhile, continuous improvement and optimization are performed according to user feedback and requirements.
The application also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the steps of a method for constructing a digital human teacher multi-modal large language model pre-training discipline corpus as described above when the computer program is executed.
The present application also provides a computer-readable storage medium storing a computer program for causing a computer to execute the steps of a method of constructing a digital human teacher multi-modal large language model pre-training discipline corpus as described above.
The above description is only of the preferred embodiments of the present application and is not intended to limit the present application, and various modifications and variations can be made to the embodiments of the present application by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims (8)

1. A method for constructing a digital human teacher multi-mode large language model pre-training subject corpus is characterized by comprising the following steps of: the method comprises the following steps:
(1) And (3) data collection: collecting subject-related multimodal data, including text, images, audio, from various subject-related documents, textbooks, curriculum materials, academic journals, websites, which data sources cover the relevant fields and content of education digital human subjects;
(2) Data preprocessing: preprocessing the collected data, namely processing text data, image data and audio data to prepare subsequent labeling and dividing work;
(3) Multimodal feature extraction and characterization learning: performing feature extraction and characterization learning on the multi-modal data based on the deep learning model;
(4) Model domain adaptation and fine tuning: performing field self-adaption and fine adjustment on the pre-training model according to the characteristics and requirements of disciplines so as to improve the performance and adaptability of the model on the task of the digital human teacher discipline;
(5) Corpus organization and management: according to the corpus data division result, the corpus data is organized into a structured corpus form so as to facilitate subsequent corpus retrieval and application.
2. The method for constructing the digital human teacher multi-modal large language model pre-training discipline corpus according to claim 1, wherein the method comprises the following steps of: the step (1) of data collection comprises the following steps:
1.1 collecting subject-related multimodal data, including text, images, and audio, from various subject-related documents, textbooks, curriculum materials, academic journals, and websites, providing the collected text data as sourcesThe image data is derived fromThe audio data source is->The relevance of each source isWherein n is 1 +n 2 +n 3 Representing the total number of data sources, c i Representing data sources s i J represents the index of all data sources Σjc j Representing the sum of the correlations for all data sources, the weight for each source can be calculated by the following formula:
wherein i is more than or equal to 1 and less than or equal to n, j is more than or equal to 1 and less than or equal to n
1.2 comprehensively considering the weight of each data source, fusing the original corpus data of different sources by adopting a weighted average mode to obtain an initial corpus data set, and setting the original corpus data collected from each source as D= { D 1 ,d 2 ,...,d n The initial corpus dataset is calculated by means of weighted averaging:
D total =∑iW(s i )*d i wherein i is more than or equal to 1 and n is more than or equal to n
By the above formula, an initial corpus data set is calculated, wherein D total Representing the final initial corpus data set.
3. The method for constructing the digital human teacher multi-modal large language model pre-training discipline corpus according to claim 1, wherein the method comprises the following steps of: the step (2) of data preprocessing includes processing of text data, image data and audio data, and comprises the following steps:
2.1 text data preprocessing
2.1.1 removal of punctuation: removing or replacing punctuation marks in the text with spaces by using a regular expression or a predefined punctuation mark list;
2.1.2 remove stop words: removing the matched stop word by traversing the words in the text and comparing with the stop word list using the stop word list in the natural language processing library NLTK; using stop words ('englist') provided by NLTK to obtain English stop word list, and combining list derivation formula and condition judgment to realize stop word removal operation so as to reduce influence on analysis and modeling;
2.1.3 stem extraction or morphological reduction: processing words in the text by using a stem extractor in an NLTK library, traversing the words in the text, and converting the words into original forms by applying corresponding stem extraction;
2.2 image data preprocessing
2.2.1 format conversion: converting the image data into PNG format of a specific format to maintain compatibility;
2.2.2 image enhancement: image processing technology is applied to the acquired images through an opencv library, and contrast enhancement, color correction and image smoothing are mainly performed in a unified mode, so that the image quality and the distinguishability of visual features are improved;
2.2.3 target detection and clipping: performing target detection and cutting by using a target detection algorithm YOLO, using a trained YOLO model, realizing target detection by calling a corresponding function, extracting the position information of the target, and cutting out a specific object or region by using a function in an image processing library according to the position information of the target;
2.3 Audio data Pre-processing
2.3.1 format conversion: converting the audio data into a specific MP3 format, and ensuring consistency and compatibility;
2.3.2 noise reduction treatment: applying a gaussian filtering algorithm to reduce the effect of background noise on the speech signal;
2.3.3 feature extraction: extracting Mel spectrum characteristics from the audio for subsequent voice analysis and modeling;
through the steps, the preprocessed multi-mode data can be obtained, and the preprocessed text data is set as D text The image data is D image The audio data is D audio
4. The method for constructing the digital human teacher multi-modal large language model pre-training discipline corpus according to claim 1, wherein the method comprises the following steps of: the multi-modal feature extraction and characterization learning of step (3) includes the steps of:
in the multi-mode task, the data of different modes contains rich information, and the semantics and visual characteristics of the data can be better captured by extracting the characteristics of the data and learning the characteristics of the data, so that the performance of the subsequent task is improved, and the specific steps are as follows:
3.1 text data feature extraction and characterization learning
An improved method is introduced for feature extraction and feature learning of text data, a text representation with more expressive power is obtained in combination with a pre-training language model and a self-attention mechanism, firstly, an embedded representation of the text data is obtained by the pre-training language model BERT, which is denoted as a text embedding matrix, denoted as text_emmbeddings,
text_embeddings=BERT(D′ text )
embedding a matrix text_webdings captures semantic information and context of the text;
further feature extraction is then performed on the text-embedding matrix using a self-attention mechanism SA, which allows each word in the text to interact with other words, automatically learn importance weights between the different words according to their relationships,
self_attention=MHA(text_embeddings)
T=SA(text_embeddings)
wherein self_attention represents the self-attention mechanism model; MHA represents a multi-head attention mechanism model, SA represents a self-attention mechanism, and T represents a text feature representation after being processed by the self-attention mechanism;
3.2 image data feature extraction and characterization learning
After preprocessing image data, adopting Resnet as a convolutional neural network model to extract the characteristics of the image and learn the characteristics of the image, wherein Resnet comprises a plurality of basic blocks, each basic block comprises a series of convolutional layers, batch normalization and activation functions and is used for establishing a deep network structure, an attention mechanism is introduced into the basic blocks of Resnet and is used for capturing the associated information between different positions in the image, an attention module is added after the last convolutional layer of each basic block, and the input characteristic diagram of the basic block is assumed to be X, and the size is H multiplied by W multiplied by C;
3.2.1 feature map computation
Query feature map Q: the method comprises the steps of obtaining a query feature map Q by carrying out convolution operation on an input feature map, wherein the size of the query feature map Q is H multiplied by W multiplied by C';
key feature map K: the key feature map K is obtained through convolution operation on the input feature map, and the size is H multiplied by W multiplied by C';
value feature map V: the method comprises the steps of performing convolution operation on an input feature map to obtain a value feature map V, wherein the size of the value feature map V is H multiplied by W multiplied by C';
h and W represent the height and width of the feature map in the spatial dimension, C represents the number of channels of the feature map, and C' represents the number of channels of the new feature map calculated by the feature map;
3.2.2 similarity calculation:
for each pixel location (i, j), a query feature Q is calculated i And key feature K i Similarity scores between the two are obtained by using dot product operation to obtain a similarity matrix S, and the size is H multiplied by W:
S(i,j)=Q i ·K i
3.2.3 similarity normalization
Normalizing the similarity matrix S, converting the similarity score into an attention weight a using a softmax function, with dimensions h×w:
A(i,j)=softmax(S(i,j))
3.2.4 weighted fusion
Multiplying the attention weight A by the value feature map V, and summing the products to obtain an attention output feature map I of the basic block, wherein the size is H multiplied by W multiplied by C':
I(i,j)=Σ(A(i,j)·V i )
3.2.5 feature map transformation
The attention output profile I is converted by a convolution layer to the same number of channels C as the input profile X, to connect with subsequent basic blocks,
after the last convolution layer of each basic block, the attention module carries out self-adaptive feature weighting on the feature map through the processes of calculating similarity, normalizing and weighting fusion, so that the characterization capability of the basic block is enhanced;
3.3 Audio data feature extraction and characterization learning
In the characteristic extraction process of the convolutional neural network, mel frequency spectrum characteristics or MFCC coefficients are taken as input, characteristic extraction and dimension reduction are carried out through a convolutional layer and a pooling layer, the output of the convolutional layer is activated by using an activation function, the output of the last convolutional layer is taken as characteristic representation of audio data, and Mel frequency spectrum characteristics of the input audio data are represented as X Mel The dimension is (m, n), where m represents the number of frames and n represents the feature dimension;
for the ith convolution kernel, the convolution operation may be expressed as:
C i =f(X Mel *w i +b i )
wherein C is i Output feature map representing the ith convolution kernel, representing convolution operation, w i Weight parameter representing the ith convolution kernel, b i A bias parameter representing an ith convolution kernel;
performing activation function processing on the convolution output to enhance the feature expression capability:
A i =ReLU(C i )
the output a of the last convolution layer is the characteristic representation of the audio data.
5. The method for constructing the digital human teacher multi-modal large language model pre-training discipline corpus according to claim 1, wherein the method comprises the following steps of: step (4) model domain adaptation and fine tuning includes the steps of:
4.1 feature fusion
The features of different modes are fused to establish a multi-mode feature representation, a text feature representation is T, an image feature representation is I, a voice feature representation is A, and the feature fusion is carried out by adopting the following formula:
Fused_Feature=Concatenate(T,I,A)
fused_feature represents the Fused multi-modal Feature representation, and Concate represents the spliced text, image and voice features;
4.2 Multi-task learning
Subject tasks include text question answers and image question answers, multitasking learning using the following formulas:
Loss=λ 1 *Loss text2 *Loss image3 *Loss audio
considering the loss functions of text questions, image questions and audio answers, lambda 1 、λ 2 And lambda (lambda) 3 Is a weight parameter.
4.3 transfer learning
According to the characteristics and data distribution of the digital human teacher subject task, the model is self-adaptive, the supervised training or the countermeasure training technology is adopted, the labeling data of the digital human teacher subject task is self-adaptive in the field by adopting the following formula:
Loss=Loss pretrained +β*Loss Specific
comprehensively considering the Loss of the pre-training model and the Loss of the specific subject task, the Loss represents the total Loss and is used for measuring the performance of the model on the pre-training and the specific subject task pretrained Representing Loss of pre-trained model, loss Specific Representing the loss of a particular discipline task, β is a weight parameter.
6. The method for constructing the digital human teacher multi-modal large language model pre-training discipline corpus according to claim 1, wherein the method comprises the following steps of: the corpus organization and management in the step (5) comprises the following steps:
5.1, establishing an index:
establishing an index is an important step for improving the retrieval efficiency of the corpus, and the index is a data structure used for quickly searching and positioning documents or corpus data in the corpus;
5.2 designing a query interface:
in order to facilitate the user to search the corpus, a query interface is designed, the query interface is based on keyword search, the user inputs keywords, the system returns related documents or corpus data containing the keywords, and a higher-level query interface is designed, so that the user's query intention can be understood by using a natural language processing technology based on semantic query, and more accurate results can be provided;
5.3 realizing the retrieval function:
according to the designed query interface, the retrieval function of the corpus is realized, the documents or the corpus data in the corpus are rapidly retrieved through the index and the query interface, and the result matched with the query condition is returned;
5.4 tissue corpus data:
according to the results and the requirements of division and classification, the corpus data is organized into a structured form, and a hierarchical structure, a directory structure or a label system is adopted for organization, so that a user can browse and access according to disciplines, topics and difficulty levels, and meanwhile metadata can be recorded on the corpus data, so that management and retrieval are convenient;
5.5 maintenance and update:
the regular maintenance and updating of the constructed corpus is an important step for maintaining timeliness and accuracy of the corpus, and the maintenance and updating operation comprises the steps of adding new corpus data, correcting errors or updating old data and adjusting classification; and meanwhile, continuous improvement and optimization are performed according to user feedback and requirements.
7. An electronic device, characterized in that: comprising a memory, a processor and a computer program stored on the memory and executable on the processor, said processor executing the steps of a method for constructing a digital human teacher multi-modal large language model pre-training discipline corpus as claimed in any of claims 1 to 6.
8. A computer-readable storage medium storing a computer program, wherein the computer program causes a computer to perform the steps of a method for constructing a digital human teacher multi-modal large language model pre-training discipline corpus according to any one of claims 1 to 6.
CN202310843136.8A 2023-07-11 2023-07-11 Method for constructing digital human teacher multi-mode large language model pre-training discipline corpus Pending CN117076693A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310843136.8A CN117076693A (en) 2023-07-11 2023-07-11 Method for constructing digital human teacher multi-mode large language model pre-training discipline corpus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310843136.8A CN117076693A (en) 2023-07-11 2023-07-11 Method for constructing digital human teacher multi-mode large language model pre-training discipline corpus

Publications (1)

Publication Number Publication Date
CN117076693A true CN117076693A (en) 2023-11-17

Family

ID=88718314

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310843136.8A Pending CN117076693A (en) 2023-07-11 2023-07-11 Method for constructing digital human teacher multi-mode large language model pre-training discipline corpus

Country Status (1)

Country Link
CN (1) CN117076693A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117349675A (en) * 2023-12-04 2024-01-05 环球数科集团有限公司 Multi-mode large model construction system for multiple information sources
CN117453921A (en) * 2023-12-22 2024-01-26 南京华飞数据技术有限公司 Data information label processing method of large language model
CN117808945A (en) * 2024-03-01 2024-04-02 北京烽火万家科技有限公司 Digital person generation system based on large-scale pre-training language model
CN117875304A (en) * 2024-01-11 2024-04-12 西安西维迈创科技有限公司 Corpus construction method, system and storage medium for subway field
CN117994101A (en) * 2024-04-03 2024-05-07 北京师范大学珠海校区 Teaching design generation method and device based on large language model
CN118170933A (en) * 2024-05-13 2024-06-11 之江实验室 Construction method and device of multi-mode corpus data oriented to scientific field

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117349675A (en) * 2023-12-04 2024-01-05 环球数科集团有限公司 Multi-mode large model construction system for multiple information sources
CN117349675B (en) * 2023-12-04 2024-03-01 环球数科集团有限公司 Multi-mode large model construction system for multiple information sources
CN117453921A (en) * 2023-12-22 2024-01-26 南京华飞数据技术有限公司 Data information label processing method of large language model
CN117453921B (en) * 2023-12-22 2024-02-23 南京华飞数据技术有限公司 Data information label processing method of large language model
CN117875304A (en) * 2024-01-11 2024-04-12 西安西维迈创科技有限公司 Corpus construction method, system and storage medium for subway field
CN117808945A (en) * 2024-03-01 2024-04-02 北京烽火万家科技有限公司 Digital person generation system based on large-scale pre-training language model
CN117994101A (en) * 2024-04-03 2024-05-07 北京师范大学珠海校区 Teaching design generation method and device based on large language model
CN118170933A (en) * 2024-05-13 2024-06-11 之江实验室 Construction method and device of multi-mode corpus data oriented to scientific field

Similar Documents

Publication Publication Date Title
CN111090987B (en) Method and apparatus for outputting information
CN117076693A (en) Method for constructing digital human teacher multi-mode large language model pre-training discipline corpus
CN111159485B (en) Tail entity linking method, device, server and storage medium
CN111209738A (en) Multi-task named entity recognition method combining text classification
WO2023065617A1 (en) Cross-modal retrieval system and method based on pre-training model and recall and ranking
WO2020010834A1 (en) Faq question and answer library generalization method, apparatus, and device
CN111274790B (en) Chapter-level event embedding method and device based on syntactic dependency graph
CN112818093A (en) Evidence document retrieval method, system and storage medium based on semantic matching
CN107480194B (en) Method and system for constructing multi-mode knowledge representation automatic learning model
CN111180025A (en) Method and device for representing medical record text vector and inquiry system
CN112861540A (en) Broadcast television news keyword automatic extraction method based on deep learning
CN117112760A (en) Intelligent education big model based on knowledge base
CN112199954B (en) Disease entity matching method and device based on voice semantics and computer equipment
CN114281948A (en) Summary determination method and related equipment thereof
CN116975403A (en) Content retrieval model, content retrieval processing method and device and computer equipment
CN117216617A (en) Text classification model training method, device, computer equipment and storage medium
CN116821351A (en) Span information-based end-to-end power knowledge graph relation extraction method
CN114491076B (en) Data enhancement method, device, equipment and medium based on domain knowledge graph
CN116244277A (en) NLP (non-linear point) identification and knowledge base construction method and system
CN113378826B (en) Data processing method, device, equipment and storage medium
CN114417863A (en) Word weight generation model training method and device and word weight generation method and device
CN115114916A (en) User feedback data analysis method and device and computer equipment
CN115203532A (en) Project recommendation method and device, electronic equipment and storage medium
CN114756617A (en) Method, system, equipment and storage medium for extracting structured data of engineering archives
CN114661900A (en) Text annotation recommendation method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination