CN112257445A

CN112257445A - Multi-modal tweet named entity recognition method based on text-picture relation pre-training

Info

Publication number: CN112257445A
Application number: CN202011116968.2A
Authority: CN
Inventors: 翁芳胜; 孙霖; 王跻权; 孙宇轩
Original assignee: Zhejiang University City College ZUCC
Current assignee: Zhejiang University City College ZUCC
Priority date: 2020-10-19
Filing date: 2020-10-19
Publication date: 2021-01-22
Anticipated expiration: 2040-10-19
Also published as: CN112257445B

Abstract

The invention relates to a multi-mode tweet named entity recognition method based on text-picture relation pre-training, which comprises the following steps: step 1, large-scale data collection; step 2, establishing a pre-trained multi-modal network model (RIVA model) for relationship inference and visual attention; and step 3, pre-training a task. The invention has the beneficial effects that: the present invention utilizes relational inference and visual attention to facilitate better fusion of multimodal information by mitigating the negative impact that multimodal models have when fusing unmatched visual and textual information. The invention uses a teacher-student semi-supervised learning method to perform image-text relationship pre-training on large unmarked text-pushing data which can be obtained in batch to generate a data set with a label, and then fine-tuning is performed on a manually marked small data set, so that the performance of a text image classification network is improved while data is expanded.

Description

Multi-modal tweet named entity recognition method based on text-picture relation pre-training

Technical Field

The invention belongs to the field of tweet naming recognition, and mainly relates to a pre-trained multi-modal network (RIVA) based on relationship inference and visual attention, and text-image relationship classification is carried out on a large unmarked multi-modal corpus by using a teacher-student semi-supervised paradigm.

Background

Twitter, etc. social media have become a part of many people's daily lives. It is an important data source for various applications such as open domain event extraction, social knowledge graph, etc., and named entity recognition of tweets is the first step of these tasks. Named Entity Recognition (NER) achieved excellent performance on news articles. However, the named entity recognition results on tweets are still unsatisfactory due to the shortness of the tweet message, the insufficient context available for reasoning.

To overcome this problem, researchers have recently discovered, from a multi-modal perspective, that visual information is inherently related to linguistic information. They then attempt to enhance the contextual information of the text to obtain better reasoning results by using an attention mechanism to correlate visual and textual information. Zhang et al designed an Adaptive co-attentive network layer in Adaptive co-attentive network for speech recognition in tweets on third-Second AAAI Conference on Intelligent understanding, learned visual and linguistic features of the fusion vector by using a gated multimodal fusion module, and simultaneously they also proposed a multimodal tweet data set, which we call a multimodal tweet data set of the double-denier university; the visual language model of Zhang et al is abbreviated as ACN, and the ACN adopts a filter gate to judge whether the fusion features are beneficial to improving the labeling precision of each feature. Lu et al, in Proceedings of the 56th Annual Meeting of the Association for the practical Linear reasons, proposed a Visual attention model for finding image areas related to text content, and also proposed a multi-modal text hit data set, we call the MNER Twitter data set of Snap Research; the visual language model of Lu et al is abbreviated as VAM, which calculates attention weights for image regions by linear projection of text query vectors and region visual representations and gives a series of visual attention instances. The entities that can see text in a successful visual attention example appear correspondingly in the image; the failed visual attention example can be seen as the object in the picture has no relation to the entity in the text. Most of the previous visual language model work is based on the assumption that images and texts have correlation, and the situation that the images may not be related to the images is ignored. Vempala et al, In Proceedings of the 57th Annual Meeting of the Association for the general linearity, performed classification statistics on the penstroke data set according to the criteria of whether the image is augmented with the meaning of a tweet, and involved In the correlation between the predicted and transformed second and image of twitter posts; they concluded that the teletext irrelevant type accounts for approximately 56% of all teletext pairs. Huetal et al proposed Twitter100k on IEEE Transactions on Multimedia in 2017, A real-world database for week super cross-media retrieval, and we found that the irrelevant proportion of the pictures and texts can reach 60% after testing large unmarked corpus-Tatt 100k, which is similar to the result found by Vempala et al; this confirms that the text and images in the tweet are not always related, and if one forces to associate unrelated pairs of text and images, it is possible to introduce erroneous information, reducing the performance of the visual language model. Therefore, previous multi-modal fusion methods do not adequately address the negative effects that occur when text encounters irrelevant visual cues.

In summary, it is particularly important to provide a method for multi-modal tweet named entity recognition based on text-picture relationship pre-training.

Disclosure of Invention

The invention aims to overcome the defects in the prior art and provides a method for recognizing a multi-modal tweet named entity based on text-picture relation pre-training.

The method for recognizing the multi-modal tweet named entity based on text-picture relation pre-training comprises the following steps of:

step 1, large-scale data collection: using the twitter100k dataset as a large unlabeled multimodal corpus; merging the image-text relations in the Penbo text-image relation data set into a text-image related relation and a text-image unrelated relation, and dividing the Penbo text-image relation data set into a training set and a test set according to a fixed proportion; selecting a multi-mode tweet data set of the university of the Compound Dan and an MNER Twitter data set of Snap Research as a data base;

step 2, establishing a pre-trained multi-modal network model (RIVA model) for relationship inference and visual attention, wherein the pre-trained multi-modal network model for relationship inference and visual attention comprises the following steps: a text-image Relational Gated Network (RGN), an attention-directed Visual Context Network (VCN), and a Visual Language Context Network (VLCN);

step 3, pre-training a task;

and 4, applying a pre-trained multi-modal network model (RIVA model) to the multi-modal NER task: testing the pre-trained multi-modal model using the biLSTM-CRF model as a reference model for named entity recognition; embedding words into e_kInput into the biLSTM network, Conditional Random Fields (CRF) embed e using each word_kBlstm hidden vector h_tTo tag sequences with entity tags; using a pre-trained multi-modal network model (RIVA model), after a pair of text images is input, the hidden outputs of each embedded forward and backward LSTM networks in the Visual Language Context Network (VLCN) are connected into a visual language context embedding

Embedding words into e while performing multi-modal NER tasks_kBy replacement with

Preferably, the step 2 specifically comprises the following steps:

step 2.1, establishing a text-image Relation Gating Network (RGN): completing text-image relation classification by using a full connection layer based on language and visual feature fusion; learning language features of the tweet from a biLSTM (bidirectional LSTM network) network;

step 2.1.1, embedding words and characters of wordsConcatenating the combined input bilSTM network, and concatenating the forward output and backward output of the bilSTM network as the encoded text vector

Wherein d is_tAs a text vector f_t1 × dt is a text vector f_tThe size of the vector space to which the vector belongs;

step 2.1.2, extracting visual feature f from image by using ResNet_v(ii) a Based on the output size of the last convolutional layer in ResNet, an averaging pool is used over a fixed area and the whole image is represented as a fixed-dimension vector f_v；

Step 2.1.3, finally, performing point multiplication on the coded text vector and the coded image vector_t⊙f_vThen input into FC layer and softmax layer to obtain score s of binary classification and visual context gating^G；

Step 2.2, establishing a Visual Context Network (VCN) of attention guidance;

step 2.2.1, setting

Is the regional visual characteristic of a given image, where i 1.., m,

r is a regional characteristic, d_vIn terms of dimension, m × n × d_vThe output size of the last convolutional layer in ResNet, and m × n is the number of regions in the image;

step 2.2.2, capturing local visual features related to the language context using zoom point times attention, which is defined as:

in the above formula, matrix Q, matrix K and matrix V represent query, key and value, respectively; d_kIs the dimension of the bond;

step 2.2.3, query vector Q using language^s＝f_tAs a query, regional visual features V^rAs keys and values; query language vector Q by linear projection^sAnd regional visual characteristics V^rConversion to the same dimension:

and

step 2.2.4, calculate language attention

Wherein Q_sIn the form of a linguistic query vector,

for translating a language query vector Q^s，

For regional visual characteristics V after dimension conversion^r(ii) a And extending single-path attention to multi-path attention; local visual context V_cThe output of (a) is defined as:

in the above formulae (2) to (3), Q_sIn the form of a language query vector,

for translating a language query vector Q^s，

For regional view after conversion of dimensionsSensation characteristic V^r，V_cHead as a local visual context_iFor output of local visual context, i 1.. h, h is the total number of local visual context outputs;

step 2.3, establishing a Visual Language Context Network (VLCN), and learning visual language context embedding on a twitter100k data set by using a bilSTM network;

step 2.3.1, first, a visual vector is given

And a sequence of length T w_t1, T, where s^GScore, V, for visual context gating_cFor local visual context, T is the sequence w_tLength of };

step 2.3.2, using a Forward LSTM network in (w)₁,...w_t-1) Up-predicted sequence w_tAt time t equal to 0, the forward sequence input is the visual vector

While using an inverse LSTM network in (w)_t+1,...,w_T) Up-predicted sequence w_tAt time T +1, the inverted sequence is input as a visual vector

Step 2.3.3, to align word embedding in the front and back, word embedding [ BOS ] is added in the word sequence]To indicate the start, word embedding EOS is also added]Indicating the end, the sequence is expressed as ([ BOS)],w₁,...,w_T,[EOS]) (ii) a Substitution of visual features for [ BOS ] in forward prediction]Replacing [ EOS ] with visual features in backward prediction](ii) a The word and the word's character embedded concatenation are used as input to the LSTM network.

Preferably, step 3 specifically comprises the following steps:

step 3.1, classifying the text-image relation;

step 3.1.1, carrying out text-image relation classification by using a Penbo text-image relation data set to determine whether the content of the image provides effective information except the text;

step 3.1.2, enhancing a text-image relation classification task by adopting a teacher-student semi-supervised mode: firstly, training a teacher model on a twitter100k data set; then, a teacher model is used for predicting a twitter100K data set, a twitter with higher score in the text-image related categories is selected, a new pseudo label training data is constructed, and a plurality of pseudo label training data form a pseudo label 100K data set; finally, in the training of a pre-training multi-modal network model (RIVA model), a text-image Relational Gating Network (RGN) is taken as a student model, the student model is firstly trained by using pseudo-labeled training data, and noise labeling errors are reduced by fine tuning on a Penbo text-image relational data set;

step 3.1.3, set x_i＝<text_i,image_i>For the tweet pair of text images, the loss of binary relational classification of data in the penbo text-image relational dataset (Bloomberg) and the pseudo-label 100K dataset (pseudo-labeled100K) is computed by cross entropy:

in the above formulas (4) to (5),

for binary relation classification loss in the penbo text-image relation dataset, Bloomberg for the penbo text-image relation dataset,

for binary relation classification loss in the pseudo-label 100K dataset, pseudo-labeled100K is the pseudo-label 100K dataset; x is the number of_iFor image-text pairs, p (x)_i) Is a positive scoreThe probability of the class is calculated by a softmax layer;

and 3.2, predicting the next word: the Visual Language Context Network (VLCN) computes the probability of a sequence by modeling the probability of the next word in the front and back directions, the probability of the sentence being:

in the above formula, the probability p (w)_tL.) is calculated from the LSTM network after the FC layer and the softmax layer,

calculating the cross entropy loss of the predicted word; minimize target loss in the fore-aft direction:

in the above formula, the first and second carbon atoms are,

for target loss in the front-rear direction, { w_tT, T is the sequence w_tThe length of the strip is,

is a visual vector.

Preferably, in step 1, the proportion of dividing the penbo text-image relation data set into a training set and a test set is 8: 2.

preferably, the text-image Relational Gating Network (RGN) classifies based on text-image relations in step 2 and outputs a relevance score sGis between the text and the image, which is used as a gating control in a path from the attention-directed Visual Context Network (VCN) to the Visual Language Context Network (VLCN); an attention-directed Visual Context Network (VCN) is a visual text attention-based network for extracting local visual information related to text, the output of the attention-directed Visual Context Network (VCN) being a visual context that is used as an input to an LSTM network to guide the learning of a Visual Language Context Network (VLCN); the Visual Language Context Network (VLCN) is a visual language model for performing next word prediction tasks (NWP).

Preferably, the fully-connected layer based on language and visual feature fusion in step 2.1 employs element-by-element multiplication in language and visual feature fusion.

Preferably, the FC layer in step 2.1.3 is a linear neural network.

Preferably, the biLSTM-CRF model in step 4 consists of a bidirectional LSTM network and Conditional Random Fields (CRFs).

Preferably, the teacher model in step 3.1.2 is an independent network, the structure of which is the same as a text-image Relational Gated Network (RGN).

The invention has the beneficial effects that: the present invention utilizes relational inference and visual attention to facilitate better fusion of multimodal information by mitigating the negative impact that multimodal models have when fusing unmatched visual and textual information. The invention uses a teacher-student semi-supervised learning method to perform image-text relationship pre-training on large unmarked text-pushing data which can be obtained in batches to generate a data set with a label, and then fine-tuning is performed on a small data set which is marked manually, so that the performance of a text image classification network is improved while data is expanded.

Drawings

FIG. 1 is a diagram of a visual attention example of a VAM model;

FIG. 2 is a schematic diagram of the neural network structure of RIVA.

Detailed Description

The present invention will be further described with reference to the following examples. The following examples are set forth merely to aid in the understanding of the invention. It should be noted that, for a person skilled in the art, several modifications can be made to the invention without departing from the principle of the invention, and these modifications and modifications also fall within the protection scope of the claims of the present invention.

In a multimodal named entity, the main difficulty is how to merge visual information that gains in text into text, and exclude invalid information from text to generate high-quality multimodal information. The invention starts from two aspects of relation reasoning and visual attention around the picture-text relation and utilizes large-scale unsupervised data to perform semi-supervised learning so as to complete the multi-modal named entity recognition task.

As shown in fig. 1, the text corresponding to fig. 1(a) provides new and old song performances in the first concert for four years [ character radio head (band name) ], and shows a successful visual attention example, and it can be seen that the entity of the text correspondingly appears in the image. Fig. 1(b) corresponds to the text [ image kevin loff ] and [ image kell koval ] of [ organizational cleveland knight ] the top half of the highlight show a failed example of visual attention, the object in the picture having no relation to the entity in the text.

Example 1:

a multi-modal tweet named entity recognition method based on text-picture relation pre-training comprises the following steps:

1. large-scale unlabeled and labeled dataset collection

1) Twitt 100k dataset: the data set was proposed by Hu et al at 2017 and consisted of 100000 image-text pairs captured randomly from a twitter platform. Images of approximately 1/4 on the twit 100k dataset were highly correlated with their respective texts. Hu et al studied weakly supervised learning for cross media retrieval on this dataset. The invention uses the data set as a large unmarked multi-mode corpus and executes the graphic-text relationship matching and next word prediction tasks of the RIVA model.

2) On a Penbo text-image relation data set, the four image-text relations are combined into two relations, namely a text-image related relation and a text-image unrelated relation, namely a binary task set between R1U R2 and R3U R4. According to the following steps of 8: a ratio of 2 to divide the training set and the test set. The invention uses the data set to train teacher models in semi-supervised learning matched with graph-text relations and fine-tune student models.

3) The multi-modal tweet dataset of the university of double denier separates named entity types into people, locations, organizations, and others, and marks 8,257 pieces of tweet text using the BIO2 labeling scheme, and assigns 4000, 1000, and 3257 pieces of data to the training, validation, and test sets, respectively.

4) The MNER Twitter dataset of Snap Research labels 6882 pushers using the BIO labeling scheme and assigns 4817, 1032, and 1033 data to the training, validation, and test sets, respectively.

2. Designing a pre-trained multi-modal network model (RIVA model):

1) text-image Relational Gated Network (RGN)

In RGN, text-image relational classification is done by a fully connected layer based on fusion of language and visual features. The linguistic features of tweets are learned from a two-way lstm (bilstm) network. Combining words and word character embedding and inputting into the bilSTM network, and concatenating the forward output and backward output of the bilSTM network as coded text vector

Extraction of visual features f from images using ResNet_v. The output size of the last convolutional layer in ResNet is 7 × 7 × d_v. Thus, an average pool is used over a 7 × 7 region and the entire image is denoted as d_vDimension vector f_vWhen ResNet-34 is used, d _v512. Finally, the elements of the language and the visual characteristics are subjected to point multiplication f_t⊙f_vThen input into the fully connected FC and softmax layers to obtain a binary and visual context-gated score s^G。

2) Attention-directed Visual Context Network (VCN)

The output size of the last convolutional layer in ResNet is 7 × 7 × d_vWhere 7 × 7 represents 49 regions in the image. Is provided with

For regional visual features of a given image, where i ═1,...,7,j＝1,...7,

Local visual features related to language context are captured using zoom point-by-attention force. The zoom point times attention is generally defined as follows:

in the above formula, matrix Q, matrix K and matrix V represent query, key and value, respectively; d_kIs the dimension of the bond. Querying vector Q using language^s＝f_tAs a query, the regional visual features V^rAs keys and values. By linear projection of Q^sAnd V^rInto the same dimension, i.e.

And

calculating language attention

Wherein Q_sIn the form of a language query vector,

query vector Q for dimension-converted language^s，

For regional visual characteristics V after dimension conversion^r(ii) a And extending single-path attention to multi-path attention; finally, the local visual context V is determined_cThe output of (a) is defined as:

in the above formulae (2) to (3), Q_sIn the form of a language query vector,

for translating a language query vector Q^s，

For regional visual characteristics V after dimension conversion^r，V_cHead as a local visual context_iFor output of local visual context, i 1.. h, h is the total number of local visual context outputs;

3) visual Language Context Network (VLCN).

Visual language context embedding is learned on a large multimodal tweet dataset tweet 100k using a biLSTM network. First, a visual vector is given

And a sequence of length T w_t1, T, where s^GScore for visual context gating, V_cFor local visual context, T is the sequence w_tLength of the leaf. Using a forward LSTM network in (w)₁,...w_t-1) Up-predicted sequence w_tAt time t-0, the forward sequence input is the visual vector

Adding word embedding in word sequence BOS]To indicate start, word embedding EOS is also added]Indicating an end, the sequence can thus be expressed as ([ BOS)],w₁,...,w_T,[EOS]). In forward predictionReplacing [ BOS ] by visual features]Replacing [ EOS ] with visual features in backward prediction]. The word and the word's character embedded concatenation is used as input to the LSTM network, as is the bilst network input in the RGN network.

3. Pre-training tasks

Task 1: text-image relationship classification:

text-image relationship classification is performed using the penbo text-image relationship dataset to determine whether the content of the image provides valid information outside the text. The text-image relationships and the statistical data types in the penbo text-image relationship dataset are as shown in table 1 below.

Table 1 four text-image relationship tables in a penbo text-image relationship dataset

Relationship of text to image	The picture adds the semantics of the text pushing	Text is represented in a picture	Percent (%)
				R1	√	√	18.5
R2	√	×	25.6
				R3	×	√	21.9
R4	×	×	33.8

In the above table 1, R1, R2, R3 and R4 are all the codes of text-image relationship;

to take advantage of the large number of unlabeled multimodal corpora, a teacher-student semi-supervised paradigm is employed to enhance the text-image relationship classification task. Firstly, training a teacher model on a twitter100k data set; the teacher model is an independent network that is structured identically to a text-image Relational Gated Network (RGN). The teacher model is then used to predict a large unmarked multimodal corpus (twit 100k dataset). Selecting text-image related categories with higher scores (category scores)>0.6) to construct a new pseudo-labeled training data, represented as pseudo-labeled100K data set (pseudo-labeled 100K). And finally, in the training of a pre-training multi-modal network model (RIVA model), a text-image Relational Gating Network (RGN) is taken as a student model, the student model is firstly trained by using data in a' pseudo labeling 100K data set, and fine tuning is carried out on a Pengbo text-image relational coefficient data set so as to reduce noise labeling errors. Let x_i＝＜text_i,image_iIs a tweet pair of text images, computes the loss of binary relational classification of data in the penbo text-image relational dataset (Bloomberg) and the pseudo-tagged 100K dataset (pseudo-tagged 100K) by cross entropy:

in the above formulas (4) to (5),

for binary relation classification loss in the pseudo-label 100K dataset, pseudo-labeled100K is the pseudo-label 100K dataset; p (x)_i) Probability of correct classification, calculated by softmax layer;

task 2: next word prediction:

visual Language Context Networks (VLCNs) compute the probability of a sequence by modeling the probability of the next word in both the front and back directions. The probability of this sentence is:

probability of the above formula p (w)_tL.) is calculated by the LSTM network after the full connectivity layer FC and softmax layers,

through calculating the cross entropy loss of the predicted words. Thus, the training task is to minimize the target loss in the front-to-back direction:

in the above formula, the first and second carbon atoms are,

is a visual vector;

4. using a pre-trained multi-modal network model (RIVA model) for multi-modal NER tasks

The pre-trained multi-modal model was tested using the biLSTM-CRF model as a benchmark model for named entity recognition. The biLSTM-CRF model consists of a bidirectional LSTM network and Conditional Random Fields (CRFs); embedding words into e_kInput into the biLSTM network, Conditional Random Fields (CRF) embed e using each word_kBlstm hidden vector h_tTo tag sequences with entity tags; using a pre-trained multi-modal network model (RIVA model), after a pair of text images is input, the hidden outputs of each embedded forward and backward LSTM networks in the Visual Language Context Network (VLCN) are connected into a visual language context embedding

Example 2:

1. parameter setting

100-dimensional GloVe word vectors in a pre-trained multimodal network model (RIVA model) were used, and 300-dimensional FastText Crawl word vectors were used in a bilSTM-CRF model. All images were resized to 224 x 224 to match the input of ResNet. Visual features were extracted using ResNet-34 and counted as 1^e-4Is fine-tuned by the learning rate of (c). The FC layer in fig. 2 is a linear neural network followed by a GELU function activation layer.

The model was trained on a computer equipped with NVIDIA Tesla K80(GPU) and Intel Xeon Silver 4114Processor 2.2GHz (CPU). On a GPU kernel, the training of a pre-trained multi-modal network model (RIVA model) requires about 32 hours, and the best effect can be achieved when the iterative training is carried out to 35 rounds. Table 2 below shows the values of the hyper-parameters in the pre-trained multi-modal network model (RIVA model) and the bilSTM-CRF model.

TABLE 2 hyper-parameter setup Table for RIVA and bilSTM-CRF models

2. Performance testing of image-text relational classification

Table 3 below shows the performance of a text-image Relationship Gated Network (RGN) for text-image relationship classification on a penbo text-image relationship dataset. In terms of network architecture, Lu et al represent multimodal features as concatenation of linguistic and visual features, whereas the text-image Relational Gated Network (RGN) of the present invention uses element-by-element multiplication. The advantage of element-by-element multiplication is that the parameter gradient in one modality will be more influenced by the data of another modality and enables collaborative learning of multi-modal data. The F1 score for the text-image Relational Gated Network (RGN) trained on the penbo text-image relational dataset was improved by 4.7% compared to the method proposed by Lu et al in 2018. The performance of the text-image Relationship Gated Network (RGN) was improved by 1.1% after combining the data in the pseudo-labeled100K dataset.

Table 3 table of F1 scores on penbo data set for text image relational network

3. Results of pre-training multi-modal network model (RIVA model) performance tests

Table 4 below illustrates the improved performance of the pre-trained multi-modal network model (RIVA model) compared to the bilSTM-CRF model. "BilSTM-CRF (text)" performs the NER task on the word embedding sequence. "BilSTM-CRF (text + pictures)" adds a visual function at the beginning of the embedded sequence to inform the BilSTM-CRF model of the image content. "biLSTM-CRF (text + RIVA (text + picture)" means that the pair of text images is input to the pre-trained multimodal network model (RIVA model), "biLSTM-CRF (text) + RIVA (text)" means that the text is the only input to the pre-trained multimodal network model (RIVA model), i.e. the text-image Relational Gating Network (RGN) and the attention-directed Visual Context Network (VCN) are deleted.

TABLE 4F 1 score for biLSTM-CRF after RIVA model modification

The results show that "biLSTM-CRF (text) + RIVA (text + picture)" is increased by 1.8% and 2.2%, respectively, compared to "biLSTM-CRF (text)" of the multi-modal tweet dataset of the university and the MNER Twitter dataset of Snap Research. In terms of the effect of visual features, "biLSTM-CRF (text + picture)" in the F1 score was increased by an average of 0.35% compared to "biLSTM-CRF (text)" and "biLSTM-CRF (text) + RIVA (text + picture)" was increased by an average of 1.45% compared to "biLSTM-CRF (text) + RIVA (text)". This indicates that the RIVA model can better exploit visual features to enhance context information of the tweet. In the following table 5, "-" represents no test results, and the performances of the three visual language models, i.e., the pre-trained multi-modal network model (RIVA model), the ACN model, and the VAM model, were compared with the biLSTM-CRF model. The results show that the performance is best when the pre-trained multi-modal network model (RIVA model) is connected.

TABLE 5 comparison of F1 scores after RIVA, ACN and VAM ligation to bilSTM-CRF

Visual language model	Multi-modal tweet dataset of double-denier university	MNER Twitter data set of Snap Research
			biLSTM-CRF+ACN	70.7	-
biLSTM-CRF+VAM	-	80.7
			biLSTM-CRF+RIVA	71.5	82.3

4. Investigation of RGN Elimination

The text-image Relational Gate Network (RGN) was eliminated in a pre-trained multi-modal network model (RIVA model), and the output of the attention-directed Visual Context Network (VCN) was passed directly to the input of the biLSTM network of the Visual Language Context Network (VLCN), i.e. the correlation score sigG between the output text and image was 1, to test the role of the RGN network.

Table 6 below shows that the RIVA model after RGN elimination reduces the overall F1 score by 0.7% and 0.9% on the multi-modal tweet dataset at the university of double and the MNER Twitter dataset at Snap Research. In addition, the test data is divided into two groups of 'adding pictures' and 'not adding pictures', and the influence of RGN on different text-image relation type data is eliminated in comparison. It was found that the performance of the "add picture" data hardly changed, but the performance of the "not add picture" data decreased; specifically, the F1 score for the multi-modal tweet dataset at the university of compound denier dropped by 1.2% and the F1 score for the MNER Twitter dataset at Snap Research dropped by 1.5%. This demonstrates that text-independent visual features can negatively impact learning visual language representations.

TABLE 6F 1 score table for RIVA networks before and after RGN excision

And (4) experimental conclusion:

the invention relates to a visual feature attention problem when multimodality learning is carried out on a tweet and images are unrelated to texts. The invention provides a multi-modal tweet named entity recognition method based on text-picture relation pre-training, and finds out and relieves the problem that visual features irrelevant to text can generate negative return in multi-modal NER. Experiments show that after a pre-trained multi-modal network model (RIVA model) uses a teacher-student semi-supervised training method under a multi-task framework of text-image relation classification and next word prediction, a text-image relation classification task is finished excellently. The result shows that the performance of the RIVA model is superior to that of visual language models such as ACN and VAM when multi-modal information is fused.

Claims

1. A multi-modal tweet named entity recognition method based on text-picture relation pre-training is characterized by comprising the following steps of:

step 1, large-scale data collection: using the twitter100k dataset as an unlabeled multimodal corpus; merging the image-text relations in the Penbo text-image relation data set into a text-image related relation and a text-image unrelated relation, and dividing the Penbo text-image relation data set into a training set and a test set according to a fixed proportion; selecting a multi-mode tweet data set of the university of the double-denier and an MNER Twitter data set of Snap Research as a data base;

step 2, establishing a pre-trained multi-modal network model for relationship inference and visual attention, wherein the pre-trained multi-modal network model for relationship inference and visual attention comprises the following steps: a text-image relational gating network, an attention-directed visual context network, and a visual language context network;

step 3, pre-training a task;

and 4, applying the pre-trained multi-modal network model to a multi-modal NER task: testing the pre-trained multi-modal model using the biLSTM-CRF model as a reference model for named entity recognition; embedding words into e_kInputting into a BilSTM network, conditional random fields using each word-inlayInto e_kBlstm hidden vector h_tTo tag sequences with entity tags; using the pre-trained multi-modal network model, after inputting the text image pair, the hidden outputs of each embedded forward LSTM network and backward LSTM network in the visual language context network are connected to become the visual language context embedding

2. The method of multi-modal tweet named entity recognition pre-trained based on text-to-picture relationships of claim 1, wherein: the step 2 specifically comprises the following steps:

step 2.1, establishing a text-image relation gating network: completing text-image relation classification by using a full connection layer based on language and visual feature fusion; learning language features of the tweet from the biLSTM network;

step 2.1.1, the words and the concatenation of the characters of the words are jointly input into the bilSTM network, and then the forward output and the backward output of the bilSTM network are concatenated to be used as the coded text vector

Wherein d is_tAs a text vector f_tDimension of (1 × d)_tAs a text vector f_tThe size of the vector space to which the vector belongs;

Step 2.1.3, finally, performing point multiplication on the coded text vector and the coded image vector_t⊙f_vThen input into FC layer and softmax layer to obtain twoClassification and visual context-gated score s^G；

Step 2.2, establishing an attention-oriented visual context network;

step 2.2.1, setting

Is the regional visual characteristics of a given image, where i 1.. and m, j 1.. n,

r is a regional characteristic, d_vIn terms of dimension, m × n × d_vThe output size of the last convolution layer in ResNet, m × n is the number of regions in the image;

step 2.2.3, query vector Q using language^s＝f_tAs a query, regional visual features V^rAs keys and values; language query vector Q by linear projection^sAnd regional visual characteristics V^rConversion to the same dimension:

and

step 2.2.4, calculate language attention

Wherein Q_sIn the form of a language query vector,

for translating a language query vector Q^s，

in the above formulae (2) to (3), Q_sIn the form of a language query vector,

for translating a language query vector Q^s，

For regional visual characteristics V after dimension conversion^r，V_cHead as a local visual context_iFor the output of the local visual context, i 1.., h, h is the total number of local visual context outputs;

step 2.3, establishing a visual language context network, and learning visual language context embedding on a twitter100k data set by using a biLSTM network;

step 2.3.1, first, a visual vector is given

And a sequence of length T w_t1, T, where s^GIs a score for the visual context gating that,V_cfor local visual context, T is the sequence w_tLength of };

step 2.3.2, using a Forward LSTM network in (w)₁,...w_t-1) Up-predicted sequence w_tAt time t-0, the forward sequence input is the visual vector

Step 2.3.3 adding word embedding [ BOS ] in word sequence]To indicate start, word embedding EOS is also added]Indicating the end, the sequence is expressed as ([ BOS)],w₁,...,w_T,[EOS]) (ii) a Substitution of visual features for [ BOS ] in forward prediction]Replacing [ EOS ] with visual features in backward prediction](ii) a The word and the word's character embedded concatenation is used as input to the LSTM network.

3. The method of multi-modal tweet named entity recognition pre-trained based on text-to-picture relationships of claim 1, wherein: the step 3 specifically comprises the following steps:

step 3.1, classifying the text-image relation;

step 3.1.2, enhancing a text-image relation classification task by adopting a teacher-student semi-supervised mode: firstly, training a teacher model on a twitter100k data set; then, predicting a twitter100K data set by using a teacher model, selecting a twitter with higher score in the text-image related categories, and constructing a new pseudo label training data, wherein the pseudo label training data form a pseudo label 100K data set; finally, in the training of the pre-training multi-modal network model, the text-image relational gating network is taken as a student model, the student model is firstly trained by using pseudo-labeled training data, and noise labeling errors are reduced by fine tuning on a Penbo text-image relational data set;

step 3.1.3, set x_i＝<text_i,image_i>For the tweet pair of text images, the loss of binary relation classification of the data in the penbo text-image relation dataset and the pseudo label 100K dataset is computed by cross entropy:

in the above formulas (4) to (5),

for binary relation classification loss in the pseudo-label 100K dataset, pseudo-labeled100K is the pseudo-label 100K dataset; x is the number of_iFor image-text pairs, p (x)_i) Probability of correct classification, calculated by softmax layer;

and 3.2, predicting the next word: the visual language context network calculates the probability of a sequence by simulating the probability of the next word in the front and back directions, the probability of the sentence being:

in the above formula, the probability p (w)_tL.) is composed ofCalculated for LSTM networks after the FC layer and the softmax layer,

in the above formula, the first and second carbon atoms are,

for target loss in the front-rear direction, { w_tT is the sequence { w }, T1, …, T is the sequence { w }_tThe length of the strip is,

is a visual vector.

4. The method of multi-modal tweet named entity recognition pre-trained based on text-to-picture relationships of claim 1, wherein: in the step 1, the proportion of dividing the Pengbo text-image relation data set into a training set and a test set is 8: 2.

5. the method of multi-modal tweet named entity recognition pre-trained based on text-to-picture relationships of claim 1, wherein: the text-image relational gating network classifies based on the text-image relation in the step 2, and outputs a correlation score between the text and the image, wherein the correlation score is used as a gating control in a path from the attention-oriented visual context network to the visual language context network; the attention-directed visual context network is a visual text attention-based network for extracting local visual information related to text, the output of the attention-directed visual context network is a visual context which is used as an input of the LSTM network to guide the learning of the visual language context network; the visual language context network is a visual language model for performing next word prediction tasks.

6. The method of multi-modal tweet named entity recognition pre-trained based on text-to-picture relationships of claim 2, wherein: the fully connected layer based on the language and visual feature fusion in step 2.1 adopts element-by-element multiplication in the aspect of language and visual feature fusion.

7. The method of multi-modal tweet named entity recognition pre-trained based on text-to-picture relationships of claim 2, wherein: the FC layer in step 2.1.3 is a linear neural network.

8. The method of multi-modal tweet named entity recognition pre-trained based on text-to-picture relationships of claim 1, wherein: the biLSTM-CRF model in step 4 consists of a bidirectional LSTM network and conditional random fields.

9. The method of multi-modal tweet named entity recognition pre-trained based on text-to-picture relationships of claim 3, wherein: the teacher model in step 3.1.2 is an independent network, the structure of which is the same as that of the text-image relationship gating network.