CN112257445A - Multi-modal tweet named entity recognition method based on text-picture relation pre-training - Google Patents
Multi-modal tweet named entity recognition method based on text-picture relation pre-training Download PDFInfo
- Publication number
- CN112257445A CN112257445A CN202011116968.2A CN202011116968A CN112257445A CN 112257445 A CN112257445 A CN 112257445A CN 202011116968 A CN202011116968 A CN 202011116968A CN 112257445 A CN112257445 A CN 112257445A
- Authority
- CN
- China
- Prior art keywords
- text
- visual
- network
- image
- modal
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000012549 training Methods 0.000 title claims abstract description 43
- 238000000034 method Methods 0.000 title claims abstract description 21
- 230000000007 visual effect Effects 0.000 claims abstract description 150
- 230000004927 fusion Effects 0.000 claims abstract description 12
- 238000013480 data collection Methods 0.000 claims abstract description 3
- 239000013598 vector Substances 0.000 claims description 54
- 238000012360 testing method Methods 0.000 claims description 14
- 239000011159 matrix material Substances 0.000 claims description 9
- 238000006243 chemical reaction Methods 0.000 claims description 8
- 238000011160 research Methods 0.000 claims description 8
- 238000002372 labelling Methods 0.000 claims description 7
- 238000013528 artificial neural network Methods 0.000 claims description 4
- 230000002457 bidirectional effect Effects 0.000 claims description 4
- 125000004432 carbon atom Chemical group C* 0.000 claims description 3
- 239000004576 sand Substances 0.000 claims description 3
- 238000012935 Averaging Methods 0.000 claims description 2
- 230000002708 enhancing effect Effects 0.000 claims description 2
- 238000006467 substitution reaction Methods 0.000 claims description 2
- 230000009286 beneficial effect Effects 0.000 abstract description 3
- 230000000116 mitigating effect Effects 0.000 abstract description 2
- 238000012986 modification Methods 0.000 description 4
- 230000004048 modification Effects 0.000 description 4
- 230000000694 effects Effects 0.000 description 3
- 230000003044 adaptive effect Effects 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 150000001875 compounds Chemical class 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000008030 elimination Effects 0.000 description 2
- 238000003379 elimination reaction Methods 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 238000010200 validation analysis Methods 0.000 description 2
- 101150012842 BIO2 gene Proteins 0.000 description 1
- BQCADISMDOOEFD-UHFFFAOYSA-N Silver Chemical compound [Ag] BQCADISMDOOEFD-UHFFFAOYSA-N 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 230000003190 augmentative effect Effects 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 230000000875 corresponding effect Effects 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000011835 investigation Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000007500 overflow downdraw method Methods 0.000 description 1
- 238000011056 performance test Methods 0.000 description 1
- 101150015060 sigG gene Proteins 0.000 description 1
- 229910052709 silver Inorganic materials 0.000 description 1
- 239000004332 silver Substances 0.000 description 1
- 230000004382 visual function Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Image Analysis (AREA)
- Machine Translation (AREA)
Abstract
The invention relates to a multi-mode tweet named entity recognition method based on text-picture relation pre-training, which comprises the following steps: step 1, large-scale data collection; step 2, establishing a pre-trained multi-modal network model (RIVA model) for relationship inference and visual attention; and step 3, pre-training a task. The invention has the beneficial effects that: the present invention utilizes relational inference and visual attention to facilitate better fusion of multimodal information by mitigating the negative impact that multimodal models have when fusing unmatched visual and textual information. The invention uses a teacher-student semi-supervised learning method to perform image-text relationship pre-training on large unmarked text-pushing data which can be obtained in batch to generate a data set with a label, and then fine-tuning is performed on a manually marked small data set, so that the performance of a text image classification network is improved while data is expanded.
Description
Technical Field
The invention belongs to the field of tweet naming recognition, and mainly relates to a pre-trained multi-modal network (RIVA) based on relationship inference and visual attention, and text-image relationship classification is carried out on a large unmarked multi-modal corpus by using a teacher-student semi-supervised paradigm.
Background
Twitter, etc. social media have become a part of many people's daily lives. It is an important data source for various applications such as open domain event extraction, social knowledge graph, etc., and named entity recognition of tweets is the first step of these tasks. Named Entity Recognition (NER) achieved excellent performance on news articles. However, the named entity recognition results on tweets are still unsatisfactory due to the shortness of the tweet message, the insufficient context available for reasoning.
To overcome this problem, researchers have recently discovered, from a multi-modal perspective, that visual information is inherently related to linguistic information. They then attempt to enhance the contextual information of the text to obtain better reasoning results by using an attention mechanism to correlate visual and textual information. Zhang et al designed an Adaptive co-attentive network layer in Adaptive co-attentive network for speech recognition in tweets on third-Second AAAI Conference on Intelligent understanding, learned visual and linguistic features of the fusion vector by using a gated multimodal fusion module, and simultaneously they also proposed a multimodal tweet data set, which we call a multimodal tweet data set of the double-denier university; the visual language model of Zhang et al is abbreviated as ACN, and the ACN adopts a filter gate to judge whether the fusion features are beneficial to improving the labeling precision of each feature. Lu et al, in Proceedings of the 56th Annual Meeting of the Association for the practical Linear reasons, proposed a Visual attention model for finding image areas related to text content, and also proposed a multi-modal text hit data set, we call the MNER Twitter data set of Snap Research; the visual language model of Lu et al is abbreviated as VAM, which calculates attention weights for image regions by linear projection of text query vectors and region visual representations and gives a series of visual attention instances. The entities that can see text in a successful visual attention example appear correspondingly in the image; the failed visual attention example can be seen as the object in the picture has no relation to the entity in the text. Most of the previous visual language model work is based on the assumption that images and texts have correlation, and the situation that the images may not be related to the images is ignored. Vempala et al, In Proceedings of the 57th Annual Meeting of the Association for the general linearity, performed classification statistics on the penstroke data set according to the criteria of whether the image is augmented with the meaning of a tweet, and involved In the correlation between the predicted and transformed second and image of twitter posts; they concluded that the teletext irrelevant type accounts for approximately 56% of all teletext pairs. Huetal et al proposed Twitter100k on IEEE Transactions on Multimedia in 2017, A real-world database for week super cross-media retrieval, and we found that the irrelevant proportion of the pictures and texts can reach 60% after testing large unmarked corpus-Tatt 100k, which is similar to the result found by Vempala et al; this confirms that the text and images in the tweet are not always related, and if one forces to associate unrelated pairs of text and images, it is possible to introduce erroneous information, reducing the performance of the visual language model. Therefore, previous multi-modal fusion methods do not adequately address the negative effects that occur when text encounters irrelevant visual cues.
In summary, it is particularly important to provide a method for multi-modal tweet named entity recognition based on text-picture relationship pre-training.
Disclosure of Invention
The invention aims to overcome the defects in the prior art and provides a method for recognizing a multi-modal tweet named entity based on text-picture relation pre-training.
The method for recognizing the multi-modal tweet named entity based on text-picture relation pre-training comprises the following steps of:
step 1, large-scale data collection: using the twitter100k dataset as a large unlabeled multimodal corpus; merging the image-text relations in the Penbo text-image relation data set into a text-image related relation and a text-image unrelated relation, and dividing the Penbo text-image relation data set into a training set and a test set according to a fixed proportion; selecting a multi-mode tweet data set of the university of the Compound Dan and an MNER Twitter data set of Snap Research as a data base;
step 2, establishing a pre-trained multi-modal network model (RIVA model) for relationship inference and visual attention, wherein the pre-trained multi-modal network model for relationship inference and visual attention comprises the following steps: a text-image Relational Gated Network (RGN), an attention-directed Visual Context Network (VCN), and a Visual Language Context Network (VLCN);
step 3, pre-training a task;
and 4, applying a pre-trained multi-modal network model (RIVA model) to the multi-modal NER task: testing the pre-trained multi-modal model using the biLSTM-CRF model as a reference model for named entity recognition; embedding words into ekInput into the biLSTM network, Conditional Random Fields (CRF) embed e using each wordkBlstm hidden vector htTo tag sequences with entity tags; using a pre-trained multi-modal network model (RIVA model), after a pair of text images is input, the hidden outputs of each embedded forward and backward LSTM networks in the Visual Language Context Network (VLCN) are connected into a visual language context embeddingEmbedding words into e while performing multi-modal NER taskskBy replacement with
Preferably, the step 2 specifically comprises the following steps:
step 2.1, establishing a text-image Relation Gating Network (RGN): completing text-image relation classification by using a full connection layer based on language and visual feature fusion; learning language features of the tweet from a biLSTM (bidirectional LSTM network) network;
step 2.1.1, embedding words and characters of wordsConcatenating the combined input bilSTM network, and concatenating the forward output and backward output of the bilSTM network as the encoded text vectorWherein d istAs a text vector ft1 × dt is a text vector ftThe size of the vector space to which the vector belongs;
step 2.1.2, extracting visual feature f from image by using ResNetv(ii) a Based on the output size of the last convolutional layer in ResNet, an averaging pool is used over a fixed area and the whole image is represented as a fixed-dimension vector fv;
Step 2.1.3, finally, performing point multiplication on the coded text vector and the coded image vectort⊙fvThen input into FC layer and softmax layer to obtain score s of binary classification and visual context gatingG;
Step 2.2, establishing a Visual Context Network (VCN) of attention guidance;
step 2.2.1, settingIs the regional visual characteristic of a given image, where i 1.., m,r is a regional characteristic, dvIn terms of dimension, m × n × dvThe output size of the last convolutional layer in ResNet, and m × n is the number of regions in the image;
step 2.2.2, capturing local visual features related to the language context using zoom point times attention, which is defined as:
in the above formula, matrix Q, matrix K and matrix V represent query, key and value, respectively; dkIs the dimension of the bond;
step 2.2.3, query vector Q using languages=ftAs a query, regional visual features VrAs keys and values; query language vector Q by linear projectionsAnd regional visual characteristics VrConversion to the same dimension:and
step 2.2.4, calculate language attentionWherein QsIn the form of a linguistic query vector,for translating a language query vector Qs,For regional visual characteristics V after dimension conversionr(ii) a And extending single-path attention to multi-path attention; local visual context VcThe output of (a) is defined as:
in the above formulae (2) to (3), QsIn the form of a language query vector,for translating a language query vector Qs, For regional view after conversion of dimensionsSensation characteristic Vr,VcHead as a local visual contextiFor output of local visual context, i 1.. h, h is the total number of local visual context outputs;
step 2.3, establishing a Visual Language Context Network (VLCN), and learning visual language context embedding on a twitter100k data set by using a bilSTM network;
step 2.3.1, first, a visual vector is givenAnd a sequence of length T wt1, T, where sGScore, V, for visual context gatingcFor local visual context, T is the sequence wtLength of };
step 2.3.2, using a Forward LSTM network in (w)1,...wt-1) Up-predicted sequence wtAt time t equal to 0, the forward sequence input is the visual vectorWhile using an inverse LSTM network in (w)t+1,...,wT) Up-predicted sequence wtAt time T +1, the inverted sequence is input as a visual vector
Step 2.3.3, to align word embedding in the front and back, word embedding [ BOS ] is added in the word sequence]To indicate the start, word embedding EOS is also added]Indicating the end, the sequence is expressed as ([ BOS)],w1,...,wT,[EOS]) (ii) a Substitution of visual features for [ BOS ] in forward prediction]Replacing [ EOS ] with visual features in backward prediction](ii) a The word and the word's character embedded concatenation are used as input to the LSTM network.
Preferably, step 3 specifically comprises the following steps:
step 3.1, classifying the text-image relation;
step 3.1.1, carrying out text-image relation classification by using a Penbo text-image relation data set to determine whether the content of the image provides effective information except the text;
step 3.1.2, enhancing a text-image relation classification task by adopting a teacher-student semi-supervised mode: firstly, training a teacher model on a twitter100k data set; then, a teacher model is used for predicting a twitter100K data set, a twitter with higher score in the text-image related categories is selected, a new pseudo label training data is constructed, and a plurality of pseudo label training data form a pseudo label 100K data set; finally, in the training of a pre-training multi-modal network model (RIVA model), a text-image Relational Gating Network (RGN) is taken as a student model, the student model is firstly trained by using pseudo-labeled training data, and noise labeling errors are reduced by fine tuning on a Penbo text-image relational data set;
step 3.1.3, set xi=<texti,imagei>For the tweet pair of text images, the loss of binary relational classification of data in the penbo text-image relational dataset (Bloomberg) and the pseudo-label 100K dataset (pseudo-labeled100K) is computed by cross entropy:
in the above formulas (4) to (5),for binary relation classification loss in the penbo text-image relation dataset, Bloomberg for the penbo text-image relation dataset,for binary relation classification loss in the pseudo-label 100K dataset, pseudo-labeled100K is the pseudo-label 100K dataset; x is the number ofiFor image-text pairs, p (x)i) Is a positive scoreThe probability of the class is calculated by a softmax layer;
and 3.2, predicting the next word: the Visual Language Context Network (VLCN) computes the probability of a sequence by modeling the probability of the next word in the front and back directions, the probability of the sentence being:
in the above formula, the probability p (w)tL.) is calculated from the LSTM network after the FC layer and the softmax layer,calculating the cross entropy loss of the predicted word; minimize target loss in the fore-aft direction:
in the above formula, the first and second carbon atoms are,for target loss in the front-rear direction, { wtT, T is the sequence wtThe length of the strip is,is a visual vector.
Preferably, in step 1, the proportion of dividing the penbo text-image relation data set into a training set and a test set is 8: 2.
preferably, the text-image Relational Gating Network (RGN) classifies based on text-image relations in step 2 and outputs a relevance score sGis between the text and the image, which is used as a gating control in a path from the attention-directed Visual Context Network (VCN) to the Visual Language Context Network (VLCN); an attention-directed Visual Context Network (VCN) is a visual text attention-based network for extracting local visual information related to text, the output of the attention-directed Visual Context Network (VCN) being a visual context that is used as an input to an LSTM network to guide the learning of a Visual Language Context Network (VLCN); the Visual Language Context Network (VLCN) is a visual language model for performing next word prediction tasks (NWP).
Preferably, the fully-connected layer based on language and visual feature fusion in step 2.1 employs element-by-element multiplication in language and visual feature fusion.
Preferably, the FC layer in step 2.1.3 is a linear neural network.
Preferably, the biLSTM-CRF model in step 4 consists of a bidirectional LSTM network and Conditional Random Fields (CRFs).
Preferably, the teacher model in step 3.1.2 is an independent network, the structure of which is the same as a text-image Relational Gated Network (RGN).
The invention has the beneficial effects that: the present invention utilizes relational inference and visual attention to facilitate better fusion of multimodal information by mitigating the negative impact that multimodal models have when fusing unmatched visual and textual information. The invention uses a teacher-student semi-supervised learning method to perform image-text relationship pre-training on large unmarked text-pushing data which can be obtained in batches to generate a data set with a label, and then fine-tuning is performed on a small data set which is marked manually, so that the performance of a text image classification network is improved while data is expanded.
Drawings
FIG. 1 is a diagram of a visual attention example of a VAM model;
FIG. 2 is a schematic diagram of the neural network structure of RIVA.
Detailed Description
The present invention will be further described with reference to the following examples. The following examples are set forth merely to aid in the understanding of the invention. It should be noted that, for a person skilled in the art, several modifications can be made to the invention without departing from the principle of the invention, and these modifications and modifications also fall within the protection scope of the claims of the present invention.
In a multimodal named entity, the main difficulty is how to merge visual information that gains in text into text, and exclude invalid information from text to generate high-quality multimodal information. The invention starts from two aspects of relation reasoning and visual attention around the picture-text relation and utilizes large-scale unsupervised data to perform semi-supervised learning so as to complete the multi-modal named entity recognition task.
As shown in fig. 1, the text corresponding to fig. 1(a) provides new and old song performances in the first concert for four years [ character radio head (band name) ], and shows a successful visual attention example, and it can be seen that the entity of the text correspondingly appears in the image. Fig. 1(b) corresponds to the text [ image kevin loff ] and [ image kell koval ] of [ organizational cleveland knight ] the top half of the highlight show a failed example of visual attention, the object in the picture having no relation to the entity in the text.
Example 1:
a multi-modal tweet named entity recognition method based on text-picture relation pre-training comprises the following steps:
1. large-scale unlabeled and labeled dataset collection
1) Twitt 100k dataset: the data set was proposed by Hu et al at 2017 and consisted of 100000 image-text pairs captured randomly from a twitter platform. Images of approximately 1/4 on the twit 100k dataset were highly correlated with their respective texts. Hu et al studied weakly supervised learning for cross media retrieval on this dataset. The invention uses the data set as a large unmarked multi-mode corpus and executes the graphic-text relationship matching and next word prediction tasks of the RIVA model.
2) On a Penbo text-image relation data set, the four image-text relations are combined into two relations, namely a text-image related relation and a text-image unrelated relation, namely a binary task set between R1U R2 and R3U R4. According to the following steps of 8: a ratio of 2 to divide the training set and the test set. The invention uses the data set to train teacher models in semi-supervised learning matched with graph-text relations and fine-tune student models.
3) The multi-modal tweet dataset of the university of double denier separates named entity types into people, locations, organizations, and others, and marks 8,257 pieces of tweet text using the BIO2 labeling scheme, and assigns 4000, 1000, and 3257 pieces of data to the training, validation, and test sets, respectively.
4) The MNER Twitter dataset of Snap Research labels 6882 pushers using the BIO labeling scheme and assigns 4817, 1032, and 1033 data to the training, validation, and test sets, respectively.
2. Designing a pre-trained multi-modal network model (RIVA model):
1) text-image Relational Gated Network (RGN)
In RGN, text-image relational classification is done by a fully connected layer based on fusion of language and visual features. The linguistic features of tweets are learned from a two-way lstm (bilstm) network. Combining words and word character embedding and inputting into the bilSTM network, and concatenating the forward output and backward output of the bilSTM network as coded text vectorExtraction of visual features f from images using ResNetv. The output size of the last convolutional layer in ResNet is 7 × 7 × dv. Thus, an average pool is used over a 7 × 7 region and the entire image is denoted as dvDimension vector fvWhen ResNet-34 is used, d v512. Finally, the elements of the language and the visual characteristics are subjected to point multiplication ft⊙fvThen input into the fully connected FC and softmax layers to obtain a binary and visual context-gated score sG。
2) Attention-directed Visual Context Network (VCN)
The output size of the last convolutional layer in ResNet is 7 × 7 × dvWhere 7 × 7 represents 49 regions in the image. Is provided withFor regional visual features of a given image, where i ═1,...,7,j=1,...7,Local visual features related to language context are captured using zoom point-by-attention force. The zoom point times attention is generally defined as follows:
in the above formula, matrix Q, matrix K and matrix V represent query, key and value, respectively; dkIs the dimension of the bond. Querying vector Q using languages=ftAs a query, the regional visual features VrAs keys and values. By linear projection of QsAnd VrInto the same dimension, i.e.Andcalculating language attentionWherein QsIn the form of a language query vector,query vector Q for dimension-converted languages,For regional visual characteristics V after dimension conversionr(ii) a And extending single-path attention to multi-path attention; finally, the local visual context V is determinedcThe output of (a) is defined as:
in the above formulae (2) to (3), QsIn the form of a language query vector,for translating a language query vector Qs, For regional visual characteristics V after dimension conversionr,VcHead as a local visual contextiFor output of local visual context, i 1.. h, h is the total number of local visual context outputs;
3) visual Language Context Network (VLCN).
Visual language context embedding is learned on a large multimodal tweet dataset tweet 100k using a biLSTM network. First, a visual vector is givenAnd a sequence of length T wt1, T, where sGScore for visual context gating, VcFor local visual context, T is the sequence wtLength of the leaf. Using a forward LSTM network in (w)1,...wt-1) Up-predicted sequence wtAt time t-0, the forward sequence input is the visual vectorWhile using an inverse LSTM network in (w)t+1,...,wT) Up-predicted sequence wtAt time T +1, the inverted sequence is input as a visual vectorAdding word embedding in word sequence BOS]To indicate start, word embedding EOS is also added]Indicating an end, the sequence can thus be expressed as ([ BOS)],w1,...,wT,[EOS]). In forward predictionReplacing [ BOS ] by visual features]Replacing [ EOS ] with visual features in backward prediction]. The word and the word's character embedded concatenation is used as input to the LSTM network, as is the bilst network input in the RGN network.
3. Pre-training tasks
Task 1: text-image relationship classification:
text-image relationship classification is performed using the penbo text-image relationship dataset to determine whether the content of the image provides valid information outside the text. The text-image relationships and the statistical data types in the penbo text-image relationship dataset are as shown in table 1 below.
Table 1 four text-image relationship tables in a penbo text-image relationship dataset
Relationship of text to image | The picture adds the semantics of the text pushing | Text is represented in a picture | Percent (%) |
R1 | √ | √ | 18.5 |
R2 | √ | × | 25.6 |
R3 | × | √ | 21.9 |
R4 | × | × | 33.8 |
In the above table 1, R1, R2, R3 and R4 are all the codes of text-image relationship;
to take advantage of the large number of unlabeled multimodal corpora, a teacher-student semi-supervised paradigm is employed to enhance the text-image relationship classification task. Firstly, training a teacher model on a twitter100k data set; the teacher model is an independent network that is structured identically to a text-image Relational Gated Network (RGN). The teacher model is then used to predict a large unmarked multimodal corpus (twit 100k dataset). Selecting text-image related categories with higher scores (category scores)>0.6) to construct a new pseudo-labeled training data, represented as pseudo-labeled100K data set (pseudo-labeled 100K). And finally, in the training of a pre-training multi-modal network model (RIVA model), a text-image Relational Gating Network (RGN) is taken as a student model, the student model is firstly trained by using data in a' pseudo labeling 100K data set, and fine tuning is carried out on a Pengbo text-image relational coefficient data set so as to reduce noise labeling errors. Let xi=<texti,imageiIs a tweet pair of text images, computes the loss of binary relational classification of data in the penbo text-image relational dataset (Bloomberg) and the pseudo-tagged 100K dataset (pseudo-tagged 100K) by cross entropy:
in the above formulas (4) to (5),for binary relation classification loss in the penbo text-image relation dataset, Bloomberg for the penbo text-image relation dataset,for binary relation classification loss in the pseudo-label 100K dataset, pseudo-labeled100K is the pseudo-label 100K dataset; p (x)i) Probability of correct classification, calculated by softmax layer;
task 2: next word prediction:
visual Language Context Networks (VLCNs) compute the probability of a sequence by modeling the probability of the next word in both the front and back directions. The probability of this sentence is:
probability of the above formula p (w)tL.) is calculated by the LSTM network after the full connectivity layer FC and softmax layers,through calculating the cross entropy loss of the predicted words. Thus, the training task is to minimize the target loss in the front-to-back direction:
in the above formula, the first and second carbon atoms are,for target loss in the front-rear direction, { wtT, T is the sequence wtThe length of the strip is,is a visual vector;
4. using a pre-trained multi-modal network model (RIVA model) for multi-modal NER tasks
The pre-trained multi-modal model was tested using the biLSTM-CRF model as a benchmark model for named entity recognition. The biLSTM-CRF model consists of a bidirectional LSTM network and Conditional Random Fields (CRFs); embedding words into ekInput into the biLSTM network, Conditional Random Fields (CRF) embed e using each wordkBlstm hidden vector htTo tag sequences with entity tags; using a pre-trained multi-modal network model (RIVA model), after a pair of text images is input, the hidden outputs of each embedded forward and backward LSTM networks in the Visual Language Context Network (VLCN) are connected into a visual language context embeddingEmbedding words into e while performing multi-modal NER taskskBy replacement with
Example 2:
1. parameter setting
100-dimensional GloVe word vectors in a pre-trained multimodal network model (RIVA model) were used, and 300-dimensional FastText Crawl word vectors were used in a bilSTM-CRF model. All images were resized to 224 x 224 to match the input of ResNet. Visual features were extracted using ResNet-34 and counted as 1e-4Is fine-tuned by the learning rate of (c). The FC layer in fig. 2 is a linear neural network followed by a GELU function activation layer.
The model was trained on a computer equipped with NVIDIA Tesla K80(GPU) and Intel Xeon Silver 4114Processor 2.2GHz (CPU). On a GPU kernel, the training of a pre-trained multi-modal network model (RIVA model) requires about 32 hours, and the best effect can be achieved when the iterative training is carried out to 35 rounds. Table 2 below shows the values of the hyper-parameters in the pre-trained multi-modal network model (RIVA model) and the bilSTM-CRF model.
TABLE 2 hyper-parameter setup Table for RIVA and bilSTM-CRF models
2. Performance testing of image-text relational classification
Table 3 below shows the performance of a text-image Relationship Gated Network (RGN) for text-image relationship classification on a penbo text-image relationship dataset. In terms of network architecture, Lu et al represent multimodal features as concatenation of linguistic and visual features, whereas the text-image Relational Gated Network (RGN) of the present invention uses element-by-element multiplication. The advantage of element-by-element multiplication is that the parameter gradient in one modality will be more influenced by the data of another modality and enables collaborative learning of multi-modal data. The F1 score for the text-image Relational Gated Network (RGN) trained on the penbo text-image relational dataset was improved by 4.7% compared to the method proposed by Lu et al in 2018. The performance of the text-image Relationship Gated Network (RGN) was improved by 1.1% after combining the data in the pseudo-labeled100K dataset.
Table 3 table of F1 scores on penbo data set for text image relational network
3. Results of pre-training multi-modal network model (RIVA model) performance tests
Table 4 below illustrates the improved performance of the pre-trained multi-modal network model (RIVA model) compared to the bilSTM-CRF model. "BilSTM-CRF (text)" performs the NER task on the word embedding sequence. "BilSTM-CRF (text + pictures)" adds a visual function at the beginning of the embedded sequence to inform the BilSTM-CRF model of the image content. "biLSTM-CRF (text + RIVA (text + picture)" means that the pair of text images is input to the pre-trained multimodal network model (RIVA model), "biLSTM-CRF (text) + RIVA (text)" means that the text is the only input to the pre-trained multimodal network model (RIVA model), i.e. the text-image Relational Gating Network (RGN) and the attention-directed Visual Context Network (VCN) are deleted.
TABLE 4F 1 score for biLSTM-CRF after RIVA model modification
The results show that "biLSTM-CRF (text) + RIVA (text + picture)" is increased by 1.8% and 2.2%, respectively, compared to "biLSTM-CRF (text)" of the multi-modal tweet dataset of the university and the MNER Twitter dataset of Snap Research. In terms of the effect of visual features, "biLSTM-CRF (text + picture)" in the F1 score was increased by an average of 0.35% compared to "biLSTM-CRF (text)" and "biLSTM-CRF (text) + RIVA (text + picture)" was increased by an average of 1.45% compared to "biLSTM-CRF (text) + RIVA (text)". This indicates that the RIVA model can better exploit visual features to enhance context information of the tweet. In the following table 5, "-" represents no test results, and the performances of the three visual language models, i.e., the pre-trained multi-modal network model (RIVA model), the ACN model, and the VAM model, were compared with the biLSTM-CRF model. The results show that the performance is best when the pre-trained multi-modal network model (RIVA model) is connected.
TABLE 5 comparison of F1 scores after RIVA, ACN and VAM ligation to bilSTM-CRF
Visual language model | Multi-modal tweet dataset of double-denier university | MNER Twitter data set of Snap Research |
biLSTM-CRF+ACN | 70.7 | - |
biLSTM-CRF+VAM | - | 80.7 |
biLSTM-CRF+RIVA | 71.5 | 82.3 |
4. Investigation of RGN Elimination
The text-image Relational Gate Network (RGN) was eliminated in a pre-trained multi-modal network model (RIVA model), and the output of the attention-directed Visual Context Network (VCN) was passed directly to the input of the biLSTM network of the Visual Language Context Network (VLCN), i.e. the correlation score sigG between the output text and image was 1, to test the role of the RGN network.
Table 6 below shows that the RIVA model after RGN elimination reduces the overall F1 score by 0.7% and 0.9% on the multi-modal tweet dataset at the university of double and the MNER Twitter dataset at Snap Research. In addition, the test data is divided into two groups of 'adding pictures' and 'not adding pictures', and the influence of RGN on different text-image relation type data is eliminated in comparison. It was found that the performance of the "add picture" data hardly changed, but the performance of the "not add picture" data decreased; specifically, the F1 score for the multi-modal tweet dataset at the university of compound denier dropped by 1.2% and the F1 score for the MNER Twitter dataset at Snap Research dropped by 1.5%. This demonstrates that text-independent visual features can negatively impact learning visual language representations.
TABLE 6F 1 score table for RIVA networks before and after RGN excision
And (4) experimental conclusion:
the invention relates to a visual feature attention problem when multimodality learning is carried out on a tweet and images are unrelated to texts. The invention provides a multi-modal tweet named entity recognition method based on text-picture relation pre-training, and finds out and relieves the problem that visual features irrelevant to text can generate negative return in multi-modal NER. Experiments show that after a pre-trained multi-modal network model (RIVA model) uses a teacher-student semi-supervised training method under a multi-task framework of text-image relation classification and next word prediction, a text-image relation classification task is finished excellently. The result shows that the performance of the RIVA model is superior to that of visual language models such as ACN and VAM when multi-modal information is fused.
Claims (9)
1. A multi-modal tweet named entity recognition method based on text-picture relation pre-training is characterized by comprising the following steps of:
step 1, large-scale data collection: using the twitter100k dataset as an unlabeled multimodal corpus; merging the image-text relations in the Penbo text-image relation data set into a text-image related relation and a text-image unrelated relation, and dividing the Penbo text-image relation data set into a training set and a test set according to a fixed proportion; selecting a multi-mode tweet data set of the university of the double-denier and an MNER Twitter data set of Snap Research as a data base;
step 2, establishing a pre-trained multi-modal network model for relationship inference and visual attention, wherein the pre-trained multi-modal network model for relationship inference and visual attention comprises the following steps: a text-image relational gating network, an attention-directed visual context network, and a visual language context network;
step 3, pre-training a task;
and 4, applying the pre-trained multi-modal network model to a multi-modal NER task: testing the pre-trained multi-modal model using the biLSTM-CRF model as a reference model for named entity recognition; embedding words into ekInputting into a BilSTM network, conditional random fields using each word-inlayInto ekBlstm hidden vector htTo tag sequences with entity tags; using the pre-trained multi-modal network model, after inputting the text image pair, the hidden outputs of each embedded forward LSTM network and backward LSTM network in the visual language context network are connected to become the visual language context embeddingEmbedding words into e while performing multi-modal NER taskskBy replacement with
2. The method of multi-modal tweet named entity recognition pre-trained based on text-to-picture relationships of claim 1, wherein: the step 2 specifically comprises the following steps:
step 2.1, establishing a text-image relation gating network: completing text-image relation classification by using a full connection layer based on language and visual feature fusion; learning language features of the tweet from the biLSTM network;
step 2.1.1, the words and the concatenation of the characters of the words are jointly input into the bilSTM network, and then the forward output and the backward output of the bilSTM network are concatenated to be used as the coded text vectorWherein d istAs a text vector ftDimension of (1 × d)tAs a text vector ftThe size of the vector space to which the vector belongs;
step 2.1.2, extracting visual feature f from image by using ResNetv(ii) a Based on the output size of the last convolutional layer in ResNet, an averaging pool is used over a fixed area and the whole image is represented as a fixed-dimension vector fv;
Step 2.1.3, finally, performing point multiplication on the coded text vector and the coded image vectort⊙fvThen input into FC layer and softmax layer to obtain twoClassification and visual context-gated score sG;
Step 2.2, establishing an attention-oriented visual context network;
step 2.2.1, settingIs the regional visual characteristics of a given image, where i 1.. and m, j 1.. n,r is a regional characteristic, dvIn terms of dimension, m × n × dvThe output size of the last convolution layer in ResNet, m × n is the number of regions in the image;
step 2.2.2, capturing local visual features related to the language context using zoom point times attention, which is defined as:
in the above formula, matrix Q, matrix K and matrix V represent query, key and value, respectively; dkIs the dimension of the bond;
step 2.2.3, query vector Q using languages=ftAs a query, regional visual features VrAs keys and values; language query vector Q by linear projectionsAnd regional visual characteristics VrConversion to the same dimension:and
step 2.2.4, calculate language attentionWherein QsIn the form of a language query vector,for translating a language query vector Qs,For regional visual characteristics V after dimension conversionr(ii) a And extending single-path attention to multi-path attention; local visual context VcThe output of (a) is defined as:
in the above formulae (2) to (3), QsIn the form of a language query vector,for translating a language query vector Qs,For regional visual characteristics V after dimension conversionr,VcHead as a local visual contextiFor the output of the local visual context, i 1.., h, h is the total number of local visual context outputs;
step 2.3, establishing a visual language context network, and learning visual language context embedding on a twitter100k data set by using a biLSTM network;
step 2.3.1, first, a visual vector is givenAnd a sequence of length T wt1, T, where sGIs a score for the visual context gating that,Vcfor local visual context, T is the sequence wtLength of };
step 2.3.2, using a Forward LSTM network in (w)1,...wt-1) Up-predicted sequence wtAt time t-0, the forward sequence input is the visual vectorWhile using an inverse LSTM network in (w)t+1,...,wT) Up-predicted sequence wtAt time T +1, the inverted sequence is input as a visual vector
Step 2.3.3 adding word embedding [ BOS ] in word sequence]To indicate start, word embedding EOS is also added]Indicating the end, the sequence is expressed as ([ BOS)],w1,...,wT,[EOS]) (ii) a Substitution of visual features for [ BOS ] in forward prediction]Replacing [ EOS ] with visual features in backward prediction](ii) a The word and the word's character embedded concatenation is used as input to the LSTM network.
3. The method of multi-modal tweet named entity recognition pre-trained based on text-to-picture relationships of claim 1, wherein: the step 3 specifically comprises the following steps:
step 3.1, classifying the text-image relation;
step 3.1.1, carrying out text-image relation classification by using a Penbo text-image relation data set to determine whether the content of the image provides effective information except the text;
step 3.1.2, enhancing a text-image relation classification task by adopting a teacher-student semi-supervised mode: firstly, training a teacher model on a twitter100k data set; then, predicting a twitter100K data set by using a teacher model, selecting a twitter with higher score in the text-image related categories, and constructing a new pseudo label training data, wherein the pseudo label training data form a pseudo label 100K data set; finally, in the training of the pre-training multi-modal network model, the text-image relational gating network is taken as a student model, the student model is firstly trained by using pseudo-labeled training data, and noise labeling errors are reduced by fine tuning on a Penbo text-image relational data set;
step 3.1.3, set xi=<texti,imagei>For the tweet pair of text images, the loss of binary relation classification of the data in the penbo text-image relation dataset and the pseudo label 100K dataset is computed by cross entropy:
in the above formulas (4) to (5),for binary relation classification loss in the penbo text-image relation dataset, Bloomberg for the penbo text-image relation dataset,for binary relation classification loss in the pseudo-label 100K dataset, pseudo-labeled100K is the pseudo-label 100K dataset; x is the number ofiFor image-text pairs, p (x)i) Probability of correct classification, calculated by softmax layer;
and 3.2, predicting the next word: the visual language context network calculates the probability of a sequence by simulating the probability of the next word in the front and back directions, the probability of the sentence being:
in the above formula, the probability p (w)tL.) is composed ofCalculated for LSTM networks after the FC layer and the softmax layer,calculating the cross entropy loss of the predicted word; minimize target loss in the fore-aft direction:
4. The method of multi-modal tweet named entity recognition pre-trained based on text-to-picture relationships of claim 1, wherein: in the step 1, the proportion of dividing the Pengbo text-image relation data set into a training set and a test set is 8: 2.
5. the method of multi-modal tweet named entity recognition pre-trained based on text-to-picture relationships of claim 1, wherein: the text-image relational gating network classifies based on the text-image relation in the step 2, and outputs a correlation score between the text and the image, wherein the correlation score is used as a gating control in a path from the attention-oriented visual context network to the visual language context network; the attention-directed visual context network is a visual text attention-based network for extracting local visual information related to text, the output of the attention-directed visual context network is a visual context which is used as an input of the LSTM network to guide the learning of the visual language context network; the visual language context network is a visual language model for performing next word prediction tasks.
6. The method of multi-modal tweet named entity recognition pre-trained based on text-to-picture relationships of claim 2, wherein: the fully connected layer based on the language and visual feature fusion in step 2.1 adopts element-by-element multiplication in the aspect of language and visual feature fusion.
7. The method of multi-modal tweet named entity recognition pre-trained based on text-to-picture relationships of claim 2, wherein: the FC layer in step 2.1.3 is a linear neural network.
8. The method of multi-modal tweet named entity recognition pre-trained based on text-to-picture relationships of claim 1, wherein: the biLSTM-CRF model in step 4 consists of a bidirectional LSTM network and conditional random fields.
9. The method of multi-modal tweet named entity recognition pre-trained based on text-to-picture relationships of claim 3, wherein: the teacher model in step 3.1.2 is an independent network, the structure of which is the same as that of the text-image relationship gating network.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011116968.2A CN112257445B (en) | 2020-10-19 | 2020-10-19 | Multi-mode push text named entity recognition method based on text-picture relation pre-training |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011116968.2A CN112257445B (en) | 2020-10-19 | 2020-10-19 | Multi-mode push text named entity recognition method based on text-picture relation pre-training |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112257445A true CN112257445A (en) | 2021-01-22 |
CN112257445B CN112257445B (en) | 2024-01-26 |
Family
ID=74244224
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011116968.2A Active CN112257445B (en) | 2020-10-19 | 2020-10-19 | Multi-mode push text named entity recognition method based on text-picture relation pre-training |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112257445B (en) |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112800259A (en) * | 2021-04-07 | 2021-05-14 | 武汉市真意境文化科技有限公司 | Image generation method and system based on edge closure and commonality detection |
CN113158875A (en) * | 2021-04-16 | 2021-07-23 | 重庆邮电大学 | Image-text emotion analysis method and system based on multi-mode interactive fusion network |
CN113627172A (en) * | 2021-07-26 | 2021-11-09 | 重庆邮电大学 | Entity identification method and system based on multi-granularity feature fusion and uncertain denoising |
CN113657115A (en) * | 2021-07-21 | 2021-11-16 | 内蒙古工业大学 | Multi-modal Mongolian emotion analysis method based on ironic recognition and fine-grained feature fusion |
CN113704502A (en) * | 2021-08-27 | 2021-11-26 | 电子科技大学 | Multi-mode information fusion account position identification method in social media |
CN113704547A (en) * | 2021-08-26 | 2021-11-26 | 合肥工业大学 | Multi-mode label recommendation method based on one-way supervision attention |
CN113806564A (en) * | 2021-09-22 | 2021-12-17 | 齐鲁工业大学 | Multi-mode informativeness tweet detection method and system |
CN114549850A (en) * | 2022-01-24 | 2022-05-27 | 西北大学 | Multi-modal image aesthetic quality evaluation method for solving modal loss problem |
CN114782739A (en) * | 2022-03-31 | 2022-07-22 | 电子科技大学 | Multi-modal classification model based on bidirectional long and short term memory layer and full connection layer |
CN115080766A (en) * | 2022-08-16 | 2022-09-20 | 之江实验室 | Multi-modal knowledge graph characterization system and method based on pre-training model |
JP2022141587A (en) * | 2021-03-15 | 2022-09-29 | ベイジン バイドゥ ネットコム サイエンス テクノロジー カンパニー リミテッド | Method and apparatus for acquiring pretraining model |
CN116341555A (en) * | 2023-05-26 | 2023-06-27 | 华东交通大学 | Named entity recognition method and system |
CN116842141A (en) * | 2023-08-28 | 2023-10-03 | 北京中安科技发展有限公司 | Alarm smoke linkage based digital information studying and judging method |
CN116561326B (en) * | 2023-07-10 | 2023-10-13 | 中国传媒大学 | Image text event extraction method, system and equipment based on label enhancement |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105760507A (en) * | 2016-02-23 | 2016-07-13 | 复旦大学 | Cross-modal subject correlation modeling method based on deep learning |
CN108829677A (en) * | 2018-06-05 | 2018-11-16 | 大连理工大学 | A kind of image header automatic generation method based on multi-modal attention |
US20180365562A1 (en) * | 2017-06-20 | 2018-12-20 | Battelle Memorial Institute | Prediction of social media postings as trusted news or as types of suspicious news |
CN110111399A (en) * | 2019-04-24 | 2019-08-09 | 上海理工大学 | A kind of image text generation method of view-based access control model attention |
CN111046668A (en) * | 2019-12-04 | 2020-04-21 | 北京信息科技大学 | Method and device for recognizing named entities of multi-modal cultural relic data |
CN111444721A (en) * | 2020-05-27 | 2020-07-24 | 南京大学 | Chinese text key information extraction method based on pre-training language model |
-
2020
- 2020-10-19 CN CN202011116968.2A patent/CN112257445B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105760507A (en) * | 2016-02-23 | 2016-07-13 | 复旦大学 | Cross-modal subject correlation modeling method based on deep learning |
US20180365562A1 (en) * | 2017-06-20 | 2018-12-20 | Battelle Memorial Institute | Prediction of social media postings as trusted news or as types of suspicious news |
CN108829677A (en) * | 2018-06-05 | 2018-11-16 | 大连理工大学 | A kind of image header automatic generation method based on multi-modal attention |
CN110111399A (en) * | 2019-04-24 | 2019-08-09 | 上海理工大学 | A kind of image text generation method of view-based access control model attention |
CN111046668A (en) * | 2019-12-04 | 2020-04-21 | 北京信息科技大学 | Method and device for recognizing named entities of multi-modal cultural relic data |
CN111444721A (en) * | 2020-05-27 | 2020-07-24 | 南京大学 | Chinese text key information extraction method based on pre-training language model |
Non-Patent Citations (1)
Title |
---|
谢腾;杨俊安;刘辉;: "基于BERT-BiLSTM-CRF模型的中文实体识别", 计算机***应用, no. 07 * |
Cited By (23)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2022141587A (en) * | 2021-03-15 | 2022-09-29 | ベイジン バイドゥ ネットコム サイエンス テクノロジー カンパニー リミテッド | Method and apparatus for acquiring pretraining model |
CN112800259A (en) * | 2021-04-07 | 2021-05-14 | 武汉市真意境文化科技有限公司 | Image generation method and system based on edge closure and commonality detection |
CN113158875B (en) * | 2021-04-16 | 2022-07-01 | 重庆邮电大学 | Image-text emotion analysis method and system based on multi-mode interaction fusion network |
CN113158875A (en) * | 2021-04-16 | 2021-07-23 | 重庆邮电大学 | Image-text emotion analysis method and system based on multi-mode interactive fusion network |
CN113657115A (en) * | 2021-07-21 | 2021-11-16 | 内蒙古工业大学 | Multi-modal Mongolian emotion analysis method based on ironic recognition and fine-grained feature fusion |
CN113657115B (en) * | 2021-07-21 | 2023-06-30 | 内蒙古工业大学 | Multi-mode Mongolian emotion analysis method based on ironic recognition and fine granularity feature fusion |
CN113627172A (en) * | 2021-07-26 | 2021-11-09 | 重庆邮电大学 | Entity identification method and system based on multi-granularity feature fusion and uncertain denoising |
CN113704547A (en) * | 2021-08-26 | 2021-11-26 | 合肥工业大学 | Multi-mode label recommendation method based on one-way supervision attention |
CN113704547B (en) * | 2021-08-26 | 2024-02-13 | 合肥工业大学 | Multimode tag recommendation method based on unidirectional supervision attention |
CN113704502B (en) * | 2021-08-27 | 2023-04-21 | 电子科技大学 | Multi-mode information fusion account number position identification method based on social media |
CN113704502A (en) * | 2021-08-27 | 2021-11-26 | 电子科技大学 | Multi-mode information fusion account position identification method in social media |
CN113806564B (en) * | 2021-09-22 | 2024-05-10 | 齐鲁工业大学 | Multi-mode informative text detection method and system |
CN113806564A (en) * | 2021-09-22 | 2021-12-17 | 齐鲁工业大学 | Multi-mode informativeness tweet detection method and system |
CN114549850B (en) * | 2022-01-24 | 2023-08-08 | 西北大学 | Multi-mode image aesthetic quality evaluation method for solving modal missing problem |
CN114549850A (en) * | 2022-01-24 | 2022-05-27 | 西北大学 | Multi-modal image aesthetic quality evaluation method for solving modal loss problem |
CN114782739A (en) * | 2022-03-31 | 2022-07-22 | 电子科技大学 | Multi-modal classification model based on bidirectional long and short term memory layer and full connection layer |
CN115080766B (en) * | 2022-08-16 | 2022-12-06 | 之江实验室 | Multi-modal knowledge graph characterization system and method based on pre-training model |
CN115080766A (en) * | 2022-08-16 | 2022-09-20 | 之江实验室 | Multi-modal knowledge graph characterization system and method based on pre-training model |
CN116341555A (en) * | 2023-05-26 | 2023-06-27 | 华东交通大学 | Named entity recognition method and system |
CN116341555B (en) * | 2023-05-26 | 2023-08-04 | 华东交通大学 | Named entity recognition method and system |
CN116561326B (en) * | 2023-07-10 | 2023-10-13 | 中国传媒大学 | Image text event extraction method, system and equipment based on label enhancement |
CN116842141A (en) * | 2023-08-28 | 2023-10-03 | 北京中安科技发展有限公司 | Alarm smoke linkage based digital information studying and judging method |
CN116842141B (en) * | 2023-08-28 | 2023-11-07 | 北京中安科技发展有限公司 | Alarm smoke linkage based digital information studying and judging method |
Also Published As
Publication number | Publication date |
---|---|
CN112257445B (en) | 2024-01-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112257445A (en) | Multi-modal tweet named entity recognition method based on text-picture relation pre-training | |
Nooralahzadeh et al. | Progressive transformer-based generation of radiology reports | |
CN104615608B (en) | A kind of data mining processing system and method | |
CN109635280A (en) | A kind of event extraction method based on mark | |
CN112183094B (en) | Chinese grammar debugging method and system based on multiple text features | |
CN111444367B (en) | Image title generation method based on global and local attention mechanism | |
CN109255027B (en) | E-commerce comment sentiment analysis noise reduction method and device | |
Thirumoorthy et al. | Feature selection using hybrid poor and rich optimization algorithm for text classification | |
CN109189862A (en) | A kind of construction of knowledge base method towards scientific and technological information analysis | |
CN110597979A (en) | Self-attention-based generating text summarization method | |
CN112749274B (en) | Chinese text classification method based on attention mechanism and interference word deletion | |
CN108256968A (en) | A kind of electric business platform commodity comment of experts generation method | |
CN111353306A (en) | Entity relationship and dependency Tree-LSTM-based combined event extraction method | |
CN110516098A (en) | Image labeling method based on convolutional neural networks and binary coding feature | |
CN112989806A (en) | Intelligent text error correction model training method | |
CN110909116A (en) | Entity set expansion method and system for social media | |
CN116127056A (en) | Medical dialogue abstracting method with multi-level characteristic enhancement | |
Liu et al. | UAMNer: uncertainty-aware multimodal named entity recognition in social media posts | |
CN113407697A (en) | Chinese medical question classification system for deep encyclopedia learning | |
He et al. | Syntax-aware entity representations for neural relation extraction | |
CN115422939A (en) | Fine-grained commodity named entity identification method based on big data | |
Peng et al. | Relation-aggregated cross-graph correlation learning for fine-grained image–text retrieval | |
Zhao et al. | Aligned visual semantic scene graph for image captioning | |
CN116628173B (en) | Intelligent customer service information generation system and method based on keyword extraction | |
Bergamaschi et al. | Conditional random fields with semantic enhancement for named-entity recognition |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |