CN112257445A - Multi-modal tweet named entity recognition method based on text-picture relation pre-training - Google Patents

Multi-modal tweet named entity recognition method based on text-picture relation pre-training Download PDF

Info

Publication number
CN112257445A
CN112257445A CN202011116968.2A CN202011116968A CN112257445A CN 112257445 A CN112257445 A CN 112257445A CN 202011116968 A CN202011116968 A CN 202011116968A CN 112257445 A CN112257445 A CN 112257445A
Authority
CN
China
Prior art keywords
text
visual
network
image
modal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011116968.2A
Other languages
Chinese (zh)
Other versions
CN112257445B (en
Inventor
翁芳胜
孙霖
王跻权
孙宇轩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University City College ZUCC
Original Assignee
Zhejiang University City College ZUCC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University City College ZUCC filed Critical Zhejiang University City College ZUCC
Priority to CN202011116968.2A priority Critical patent/CN112257445B/en
Publication of CN112257445A publication Critical patent/CN112257445A/en
Application granted granted Critical
Publication of CN112257445B publication Critical patent/CN112257445B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Image Analysis (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a multi-mode tweet named entity recognition method based on text-picture relation pre-training, which comprises the following steps: step 1, large-scale data collection; step 2, establishing a pre-trained multi-modal network model (RIVA model) for relationship inference and visual attention; and step 3, pre-training a task. The invention has the beneficial effects that: the present invention utilizes relational inference and visual attention to facilitate better fusion of multimodal information by mitigating the negative impact that multimodal models have when fusing unmatched visual and textual information. The invention uses a teacher-student semi-supervised learning method to perform image-text relationship pre-training on large unmarked text-pushing data which can be obtained in batch to generate a data set with a label, and then fine-tuning is performed on a manually marked small data set, so that the performance of a text image classification network is improved while data is expanded.

Description

Multi-modal tweet named entity recognition method based on text-picture relation pre-training
Technical Field
The invention belongs to the field of tweet naming recognition, and mainly relates to a pre-trained multi-modal network (RIVA) based on relationship inference and visual attention, and text-image relationship classification is carried out on a large unmarked multi-modal corpus by using a teacher-student semi-supervised paradigm.
Background
Twitter, etc. social media have become a part of many people's daily lives. It is an important data source for various applications such as open domain event extraction, social knowledge graph, etc., and named entity recognition of tweets is the first step of these tasks. Named Entity Recognition (NER) achieved excellent performance on news articles. However, the named entity recognition results on tweets are still unsatisfactory due to the shortness of the tweet message, the insufficient context available for reasoning.
To overcome this problem, researchers have recently discovered, from a multi-modal perspective, that visual information is inherently related to linguistic information. They then attempt to enhance the contextual information of the text to obtain better reasoning results by using an attention mechanism to correlate visual and textual information. Zhang et al designed an Adaptive co-attentive network layer in Adaptive co-attentive network for speech recognition in tweets on third-Second AAAI Conference on Intelligent understanding, learned visual and linguistic features of the fusion vector by using a gated multimodal fusion module, and simultaneously they also proposed a multimodal tweet data set, which we call a multimodal tweet data set of the double-denier university; the visual language model of Zhang et al is abbreviated as ACN, and the ACN adopts a filter gate to judge whether the fusion features are beneficial to improving the labeling precision of each feature. Lu et al, in Proceedings of the 56th Annual Meeting of the Association for the practical Linear reasons, proposed a Visual attention model for finding image areas related to text content, and also proposed a multi-modal text hit data set, we call the MNER Twitter data set of Snap Research; the visual language model of Lu et al is abbreviated as VAM, which calculates attention weights for image regions by linear projection of text query vectors and region visual representations and gives a series of visual attention instances. The entities that can see text in a successful visual attention example appear correspondingly in the image; the failed visual attention example can be seen as the object in the picture has no relation to the entity in the text. Most of the previous visual language model work is based on the assumption that images and texts have correlation, and the situation that the images may not be related to the images is ignored. Vempala et al, In Proceedings of the 57th Annual Meeting of the Association for the general linearity, performed classification statistics on the penstroke data set according to the criteria of whether the image is augmented with the meaning of a tweet, and involved In the correlation between the predicted and transformed second and image of twitter posts; they concluded that the teletext irrelevant type accounts for approximately 56% of all teletext pairs. Huetal et al proposed Twitter100k on IEEE Transactions on Multimedia in 2017, A real-world database for week super cross-media retrieval, and we found that the irrelevant proportion of the pictures and texts can reach 60% after testing large unmarked corpus-Tatt 100k, which is similar to the result found by Vempala et al; this confirms that the text and images in the tweet are not always related, and if one forces to associate unrelated pairs of text and images, it is possible to introduce erroneous information, reducing the performance of the visual language model. Therefore, previous multi-modal fusion methods do not adequately address the negative effects that occur when text encounters irrelevant visual cues.
In summary, it is particularly important to provide a method for multi-modal tweet named entity recognition based on text-picture relationship pre-training.
Disclosure of Invention
The invention aims to overcome the defects in the prior art and provides a method for recognizing a multi-modal tweet named entity based on text-picture relation pre-training.
The method for recognizing the multi-modal tweet named entity based on text-picture relation pre-training comprises the following steps of:
step 1, large-scale data collection: using the twitter100k dataset as a large unlabeled multimodal corpus; merging the image-text relations in the Penbo text-image relation data set into a text-image related relation and a text-image unrelated relation, and dividing the Penbo text-image relation data set into a training set and a test set according to a fixed proportion; selecting a multi-mode tweet data set of the university of the Compound Dan and an MNER Twitter data set of Snap Research as a data base;
step 2, establishing a pre-trained multi-modal network model (RIVA model) for relationship inference and visual attention, wherein the pre-trained multi-modal network model for relationship inference and visual attention comprises the following steps: a text-image Relational Gated Network (RGN), an attention-directed Visual Context Network (VCN), and a Visual Language Context Network (VLCN);
step 3, pre-training a task;
and 4, applying a pre-trained multi-modal network model (RIVA model) to the multi-modal NER task: testing the pre-trained multi-modal model using the biLSTM-CRF model as a reference model for named entity recognition; embedding words into ekInput into the biLSTM network, Conditional Random Fields (CRF) embed e using each wordkBlstm hidden vector htTo tag sequences with entity tags; using a pre-trained multi-modal network model (RIVA model), after a pair of text images is input, the hidden outputs of each embedded forward and backward LSTM networks in the Visual Language Context Network (VLCN) are connected into a visual language context embedding
Figure BDA0002730638870000021
Embedding words into e while performing multi-modal NER taskskBy replacement with
Figure BDA0002730638870000022
Preferably, the step 2 specifically comprises the following steps:
step 2.1, establishing a text-image Relation Gating Network (RGN): completing text-image relation classification by using a full connection layer based on language and visual feature fusion; learning language features of the tweet from a biLSTM (bidirectional LSTM network) network;
step 2.1.1, embedding words and characters of wordsConcatenating the combined input bilSTM network, and concatenating the forward output and backward output of the bilSTM network as the encoded text vector
Figure BDA0002730638870000031
Wherein d istAs a text vector ft1 × dt is a text vector ftThe size of the vector space to which the vector belongs;
step 2.1.2, extracting visual feature f from image by using ResNetv(ii) a Based on the output size of the last convolutional layer in ResNet, an averaging pool is used over a fixed area and the whole image is represented as a fixed-dimension vector fv
Step 2.1.3, finally, performing point multiplication on the coded text vector and the coded image vectort⊙fvThen input into FC layer and softmax layer to obtain score s of binary classification and visual context gatingG
Step 2.2, establishing a Visual Context Network (VCN) of attention guidance;
step 2.2.1, setting
Figure BDA0002730638870000032
Is the regional visual characteristic of a given image, where i 1.., m,
Figure BDA0002730638870000033
r is a regional characteristic, dvIn terms of dimension, m × n × dvThe output size of the last convolutional layer in ResNet, and m × n is the number of regions in the image;
step 2.2.2, capturing local visual features related to the language context using zoom point times attention, which is defined as:
Figure BDA0002730638870000034
in the above formula, matrix Q, matrix K and matrix V represent query, key and value, respectively; dkIs the dimension of the bond;
step 2.2.3, query vector Q using languages=ftAs a query, regional visual features VrAs keys and values; query language vector Q by linear projectionsAnd regional visual characteristics VrConversion to the same dimension:
Figure BDA0002730638870000035
and
Figure BDA0002730638870000036
step 2.2.4, calculate language attention
Figure BDA0002730638870000037
Wherein QsIn the form of a linguistic query vector,
Figure BDA0002730638870000038
for translating a language query vector Qs
Figure BDA0002730638870000039
For regional visual characteristics V after dimension conversionr(ii) a And extending single-path attention to multi-path attention; local visual context VcThe output of (a) is defined as:
Figure BDA0002730638870000041
Figure BDA0002730638870000042
in the above formulae (2) to (3), QsIn the form of a language query vector,
Figure BDA0002730638870000043
for translating a language query vector Qs
Figure BDA0002730638870000044
For regional view after conversion of dimensionsSensation characteristic Vr,VcHead as a local visual contextiFor output of local visual context, i 1.. h, h is the total number of local visual context outputs;
step 2.3, establishing a Visual Language Context Network (VLCN), and learning visual language context embedding on a twitter100k data set by using a bilSTM network;
step 2.3.1, first, a visual vector is given
Figure BDA0002730638870000045
And a sequence of length T wt1, T, where sGScore, V, for visual context gatingcFor local visual context, T is the sequence wtLength of };
step 2.3.2, using a Forward LSTM network in (w)1,...wt-1) Up-predicted sequence wtAt time t equal to 0, the forward sequence input is the visual vector
Figure BDA0002730638870000046
While using an inverse LSTM network in (w)t+1,...,wT) Up-predicted sequence wtAt time T +1, the inverted sequence is input as a visual vector
Figure BDA0002730638870000047
Step 2.3.3, to align word embedding in the front and back, word embedding [ BOS ] is added in the word sequence]To indicate the start, word embedding EOS is also added]Indicating the end, the sequence is expressed as ([ BOS)],w1,...,wT,[EOS]) (ii) a Substitution of visual features for [ BOS ] in forward prediction]Replacing [ EOS ] with visual features in backward prediction](ii) a The word and the word's character embedded concatenation are used as input to the LSTM network.
Preferably, step 3 specifically comprises the following steps:
step 3.1, classifying the text-image relation;
step 3.1.1, carrying out text-image relation classification by using a Penbo text-image relation data set to determine whether the content of the image provides effective information except the text;
step 3.1.2, enhancing a text-image relation classification task by adopting a teacher-student semi-supervised mode: firstly, training a teacher model on a twitter100k data set; then, a teacher model is used for predicting a twitter100K data set, a twitter with higher score in the text-image related categories is selected, a new pseudo label training data is constructed, and a plurality of pseudo label training data form a pseudo label 100K data set; finally, in the training of a pre-training multi-modal network model (RIVA model), a text-image Relational Gating Network (RGN) is taken as a student model, the student model is firstly trained by using pseudo-labeled training data, and noise labeling errors are reduced by fine tuning on a Penbo text-image relational data set;
step 3.1.3, set xi=<texti,imagei>For the tweet pair of text images, the loss of binary relational classification of data in the penbo text-image relational dataset (Bloomberg) and the pseudo-label 100K dataset (pseudo-labeled100K) is computed by cross entropy:
Figure BDA0002730638870000052
Figure BDA0002730638870000053
in the above formulas (4) to (5),
Figure BDA0002730638870000054
for binary relation classification loss in the penbo text-image relation dataset, Bloomberg for the penbo text-image relation dataset,
Figure BDA0002730638870000055
for binary relation classification loss in the pseudo-label 100K dataset, pseudo-labeled100K is the pseudo-label 100K dataset; x is the number ofiFor image-text pairs, p (x)i) Is a positive scoreThe probability of the class is calculated by a softmax layer;
and 3.2, predicting the next word: the Visual Language Context Network (VLCN) computes the probability of a sequence by modeling the probability of the next word in the front and back directions, the probability of the sentence being:
Figure BDA0002730638870000056
in the above formula, the probability p (w)tL.) is calculated from the LSTM network after the FC layer and the softmax layer,
Figure BDA0002730638870000057
calculating the cross entropy loss of the predicted word; minimize target loss in the fore-aft direction:
Figure BDA0002730638870000058
in the above formula, the first and second carbon atoms are,
Figure BDA0002730638870000059
for target loss in the front-rear direction, { wtT, T is the sequence wtThe length of the strip is,
Figure BDA00027306388700000510
is a visual vector.
Preferably, in step 1, the proportion of dividing the penbo text-image relation data set into a training set and a test set is 8: 2.
preferably, the text-image Relational Gating Network (RGN) classifies based on text-image relations in step 2 and outputs a relevance score sGis between the text and the image, which is used as a gating control in a path from the attention-directed Visual Context Network (VCN) to the Visual Language Context Network (VLCN); an attention-directed Visual Context Network (VCN) is a visual text attention-based network for extracting local visual information related to text, the output of the attention-directed Visual Context Network (VCN) being a visual context that is used as an input to an LSTM network to guide the learning of a Visual Language Context Network (VLCN); the Visual Language Context Network (VLCN) is a visual language model for performing next word prediction tasks (NWP).
Preferably, the fully-connected layer based on language and visual feature fusion in step 2.1 employs element-by-element multiplication in language and visual feature fusion.
Preferably, the FC layer in step 2.1.3 is a linear neural network.
Preferably, the biLSTM-CRF model in step 4 consists of a bidirectional LSTM network and Conditional Random Fields (CRFs).
Preferably, the teacher model in step 3.1.2 is an independent network, the structure of which is the same as a text-image Relational Gated Network (RGN).
The invention has the beneficial effects that: the present invention utilizes relational inference and visual attention to facilitate better fusion of multimodal information by mitigating the negative impact that multimodal models have when fusing unmatched visual and textual information. The invention uses a teacher-student semi-supervised learning method to perform image-text relationship pre-training on large unmarked text-pushing data which can be obtained in batches to generate a data set with a label, and then fine-tuning is performed on a small data set which is marked manually, so that the performance of a text image classification network is improved while data is expanded.
Drawings
FIG. 1 is a diagram of a visual attention example of a VAM model;
FIG. 2 is a schematic diagram of the neural network structure of RIVA.
Detailed Description
The present invention will be further described with reference to the following examples. The following examples are set forth merely to aid in the understanding of the invention. It should be noted that, for a person skilled in the art, several modifications can be made to the invention without departing from the principle of the invention, and these modifications and modifications also fall within the protection scope of the claims of the present invention.
In a multimodal named entity, the main difficulty is how to merge visual information that gains in text into text, and exclude invalid information from text to generate high-quality multimodal information. The invention starts from two aspects of relation reasoning and visual attention around the picture-text relation and utilizes large-scale unsupervised data to perform semi-supervised learning so as to complete the multi-modal named entity recognition task.
As shown in fig. 1, the text corresponding to fig. 1(a) provides new and old song performances in the first concert for four years [ character radio head (band name) ], and shows a successful visual attention example, and it can be seen that the entity of the text correspondingly appears in the image. Fig. 1(b) corresponds to the text [ image kevin loff ] and [ image kell koval ] of [ organizational cleveland knight ] the top half of the highlight show a failed example of visual attention, the object in the picture having no relation to the entity in the text.
Example 1:
a multi-modal tweet named entity recognition method based on text-picture relation pre-training comprises the following steps:
1. large-scale unlabeled and labeled dataset collection
1) Twitt 100k dataset: the data set was proposed by Hu et al at 2017 and consisted of 100000 image-text pairs captured randomly from a twitter platform. Images of approximately 1/4 on the twit 100k dataset were highly correlated with their respective texts. Hu et al studied weakly supervised learning for cross media retrieval on this dataset. The invention uses the data set as a large unmarked multi-mode corpus and executes the graphic-text relationship matching and next word prediction tasks of the RIVA model.
2) On a Penbo text-image relation data set, the four image-text relations are combined into two relations, namely a text-image related relation and a text-image unrelated relation, namely a binary task set between R1U R2 and R3U R4. According to the following steps of 8: a ratio of 2 to divide the training set and the test set. The invention uses the data set to train teacher models in semi-supervised learning matched with graph-text relations and fine-tune student models.
3) The multi-modal tweet dataset of the university of double denier separates named entity types into people, locations, organizations, and others, and marks 8,257 pieces of tweet text using the BIO2 labeling scheme, and assigns 4000, 1000, and 3257 pieces of data to the training, validation, and test sets, respectively.
4) The MNER Twitter dataset of Snap Research labels 6882 pushers using the BIO labeling scheme and assigns 4817, 1032, and 1033 data to the training, validation, and test sets, respectively.
2. Designing a pre-trained multi-modal network model (RIVA model):
1) text-image Relational Gated Network (RGN)
In RGN, text-image relational classification is done by a fully connected layer based on fusion of language and visual features. The linguistic features of tweets are learned from a two-way lstm (bilstm) network. Combining words and word character embedding and inputting into the bilSTM network, and concatenating the forward output and backward output of the bilSTM network as coded text vector
Figure BDA0002730638870000071
Extraction of visual features f from images using ResNetv. The output size of the last convolutional layer in ResNet is 7 × 7 × dv. Thus, an average pool is used over a 7 × 7 region and the entire image is denoted as dvDimension vector fvWhen ResNet-34 is used, d v512. Finally, the elements of the language and the visual characteristics are subjected to point multiplication ft⊙fvThen input into the fully connected FC and softmax layers to obtain a binary and visual context-gated score sG
2) Attention-directed Visual Context Network (VCN)
The output size of the last convolutional layer in ResNet is 7 × 7 × dvWhere 7 × 7 represents 49 regions in the image. Is provided with
Figure BDA0002730638870000081
For regional visual features of a given image, where i ═1,...,7,j=1,...7,
Figure BDA0002730638870000082
Local visual features related to language context are captured using zoom point-by-attention force. The zoom point times attention is generally defined as follows:
Figure BDA0002730638870000083
in the above formula, matrix Q, matrix K and matrix V represent query, key and value, respectively; dkIs the dimension of the bond. Querying vector Q using languages=ftAs a query, the regional visual features VrAs keys and values. By linear projection of QsAnd VrInto the same dimension, i.e.
Figure BDA0002730638870000084
And
Figure BDA0002730638870000085
calculating language attention
Figure BDA0002730638870000086
Wherein QsIn the form of a language query vector,
Figure BDA0002730638870000087
query vector Q for dimension-converted languages
Figure BDA0002730638870000088
For regional visual characteristics V after dimension conversionr(ii) a And extending single-path attention to multi-path attention; finally, the local visual context V is determinedcThe output of (a) is defined as:
Figure BDA0002730638870000089
Figure BDA00027306388700000810
in the above formulae (2) to (3), QsIn the form of a language query vector,
Figure BDA00027306388700000811
for translating a language query vector Qs
Figure BDA00027306388700000812
For regional visual characteristics V after dimension conversionr,VcHead as a local visual contextiFor output of local visual context, i 1.. h, h is the total number of local visual context outputs;
3) visual Language Context Network (VLCN).
Visual language context embedding is learned on a large multimodal tweet dataset tweet 100k using a biLSTM network. First, a visual vector is given
Figure BDA00027306388700000813
And a sequence of length T wt1, T, where sGScore for visual context gating, VcFor local visual context, T is the sequence wtLength of the leaf. Using a forward LSTM network in (w)1,...wt-1) Up-predicted sequence wtAt time t-0, the forward sequence input is the visual vector
Figure BDA00027306388700000814
While using an inverse LSTM network in (w)t+1,...,wT) Up-predicted sequence wtAt time T +1, the inverted sequence is input as a visual vector
Figure BDA00027306388700000815
Adding word embedding in word sequence BOS]To indicate start, word embedding EOS is also added]Indicating an end, the sequence can thus be expressed as ([ BOS)],w1,...,wT,[EOS]). In forward predictionReplacing [ BOS ] by visual features]Replacing [ EOS ] with visual features in backward prediction]. The word and the word's character embedded concatenation is used as input to the LSTM network, as is the bilst network input in the RGN network.
3. Pre-training tasks
Task 1: text-image relationship classification:
text-image relationship classification is performed using the penbo text-image relationship dataset to determine whether the content of the image provides valid information outside the text. The text-image relationships and the statistical data types in the penbo text-image relationship dataset are as shown in table 1 below.
Table 1 four text-image relationship tables in a penbo text-image relationship dataset
Relationship of text to image The picture adds the semantics of the text pushing Text is represented in a picture Percent (%)
R1 18.5
R2 × 25.6
R3 × 21.9
R4 × × 33.8
In the above table 1, R1, R2, R3 and R4 are all the codes of text-image relationship;
to take advantage of the large number of unlabeled multimodal corpora, a teacher-student semi-supervised paradigm is employed to enhance the text-image relationship classification task. Firstly, training a teacher model on a twitter100k data set; the teacher model is an independent network that is structured identically to a text-image Relational Gated Network (RGN). The teacher model is then used to predict a large unmarked multimodal corpus (twit 100k dataset). Selecting text-image related categories with higher scores (category scores)>0.6) to construct a new pseudo-labeled training data, represented as pseudo-labeled100K data set (pseudo-labeled 100K). And finally, in the training of a pre-training multi-modal network model (RIVA model), a text-image Relational Gating Network (RGN) is taken as a student model, the student model is firstly trained by using data in a' pseudo labeling 100K data set, and fine tuning is carried out on a Pengbo text-image relational coefficient data set so as to reduce noise labeling errors. Let xi=<texti,imageiIs a tweet pair of text images, computes the loss of binary relational classification of data in the penbo text-image relational dataset (Bloomberg) and the pseudo-tagged 100K dataset (pseudo-tagged 100K) by cross entropy:
Figure BDA0002730638870000091
Figure BDA0002730638870000092
in the above formulas (4) to (5),
Figure BDA0002730638870000093
for binary relation classification loss in the penbo text-image relation dataset, Bloomberg for the penbo text-image relation dataset,
Figure BDA0002730638870000101
for binary relation classification loss in the pseudo-label 100K dataset, pseudo-labeled100K is the pseudo-label 100K dataset; p (x)i) Probability of correct classification, calculated by softmax layer;
task 2: next word prediction:
visual Language Context Networks (VLCNs) compute the probability of a sequence by modeling the probability of the next word in both the front and back directions. The probability of this sentence is:
Figure BDA0002730638870000102
probability of the above formula p (w)tL.) is calculated by the LSTM network after the full connectivity layer FC and softmax layers,
Figure BDA0002730638870000103
through calculating the cross entropy loss of the predicted words. Thus, the training task is to minimize the target loss in the front-to-back direction:
Figure BDA0002730638870000104
in the above formula, the first and second carbon atoms are,
Figure BDA0002730638870000105
for target loss in the front-rear direction, { wtT, T is the sequence wtThe length of the strip is,
Figure BDA0002730638870000106
is a visual vector;
4. using a pre-trained multi-modal network model (RIVA model) for multi-modal NER tasks
The pre-trained multi-modal model was tested using the biLSTM-CRF model as a benchmark model for named entity recognition. The biLSTM-CRF model consists of a bidirectional LSTM network and Conditional Random Fields (CRFs); embedding words into ekInput into the biLSTM network, Conditional Random Fields (CRF) embed e using each wordkBlstm hidden vector htTo tag sequences with entity tags; using a pre-trained multi-modal network model (RIVA model), after a pair of text images is input, the hidden outputs of each embedded forward and backward LSTM networks in the Visual Language Context Network (VLCN) are connected into a visual language context embedding
Figure BDA0002730638870000107
Embedding words into e while performing multi-modal NER taskskBy replacement with
Figure BDA0002730638870000108
Example 2:
1. parameter setting
100-dimensional GloVe word vectors in a pre-trained multimodal network model (RIVA model) were used, and 300-dimensional FastText Crawl word vectors were used in a bilSTM-CRF model. All images were resized to 224 x 224 to match the input of ResNet. Visual features were extracted using ResNet-34 and counted as 1e-4Is fine-tuned by the learning rate of (c). The FC layer in fig. 2 is a linear neural network followed by a GELU function activation layer.
The model was trained on a computer equipped with NVIDIA Tesla K80(GPU) and Intel Xeon Silver 4114Processor 2.2GHz (CPU). On a GPU kernel, the training of a pre-trained multi-modal network model (RIVA model) requires about 32 hours, and the best effect can be achieved when the iterative training is carried out to 35 rounds. Table 2 below shows the values of the hyper-parameters in the pre-trained multi-modal network model (RIVA model) and the bilSTM-CRF model.
TABLE 2 hyper-parameter setup Table for RIVA and bilSTM-CRF models
Figure BDA0002730638870000111
2. Performance testing of image-text relational classification
Table 3 below shows the performance of a text-image Relationship Gated Network (RGN) for text-image relationship classification on a penbo text-image relationship dataset. In terms of network architecture, Lu et al represent multimodal features as concatenation of linguistic and visual features, whereas the text-image Relational Gated Network (RGN) of the present invention uses element-by-element multiplication. The advantage of element-by-element multiplication is that the parameter gradient in one modality will be more influenced by the data of another modality and enables collaborative learning of multi-modal data. The F1 score for the text-image Relational Gated Network (RGN) trained on the penbo text-image relational dataset was improved by 4.7% compared to the method proposed by Lu et al in 2018. The performance of the text-image Relationship Gated Network (RGN) was improved by 1.1% after combining the data in the pseudo-labeled100K dataset.
Table 3 table of F1 scores on penbo data set for text image relational network
Figure BDA0002730638870000112
3. Results of pre-training multi-modal network model (RIVA model) performance tests
Table 4 below illustrates the improved performance of the pre-trained multi-modal network model (RIVA model) compared to the bilSTM-CRF model. "BilSTM-CRF (text)" performs the NER task on the word embedding sequence. "BilSTM-CRF (text + pictures)" adds a visual function at the beginning of the embedded sequence to inform the BilSTM-CRF model of the image content. "biLSTM-CRF (text + RIVA (text + picture)" means that the pair of text images is input to the pre-trained multimodal network model (RIVA model), "biLSTM-CRF (text) + RIVA (text)" means that the text is the only input to the pre-trained multimodal network model (RIVA model), i.e. the text-image Relational Gating Network (RGN) and the attention-directed Visual Context Network (VCN) are deleted.
TABLE 4F 1 score for biLSTM-CRF after RIVA model modification
Figure BDA0002730638870000121
The results show that "biLSTM-CRF (text) + RIVA (text + picture)" is increased by 1.8% and 2.2%, respectively, compared to "biLSTM-CRF (text)" of the multi-modal tweet dataset of the university and the MNER Twitter dataset of Snap Research. In terms of the effect of visual features, "biLSTM-CRF (text + picture)" in the F1 score was increased by an average of 0.35% compared to "biLSTM-CRF (text)" and "biLSTM-CRF (text) + RIVA (text + picture)" was increased by an average of 1.45% compared to "biLSTM-CRF (text) + RIVA (text)". This indicates that the RIVA model can better exploit visual features to enhance context information of the tweet. In the following table 5, "-" represents no test results, and the performances of the three visual language models, i.e., the pre-trained multi-modal network model (RIVA model), the ACN model, and the VAM model, were compared with the biLSTM-CRF model. The results show that the performance is best when the pre-trained multi-modal network model (RIVA model) is connected.
TABLE 5 comparison of F1 scores after RIVA, ACN and VAM ligation to bilSTM-CRF
Visual language model Multi-modal tweet dataset of double-denier university MNER Twitter data set of Snap Research
biLSTM-CRF+ACN 70.7 -
biLSTM-CRF+VAM - 80.7
biLSTM-CRF+RIVA 71.5 82.3
4. Investigation of RGN Elimination
The text-image Relational Gate Network (RGN) was eliminated in a pre-trained multi-modal network model (RIVA model), and the output of the attention-directed Visual Context Network (VCN) was passed directly to the input of the biLSTM network of the Visual Language Context Network (VLCN), i.e. the correlation score sigG between the output text and image was 1, to test the role of the RGN network.
Table 6 below shows that the RIVA model after RGN elimination reduces the overall F1 score by 0.7% and 0.9% on the multi-modal tweet dataset at the university of double and the MNER Twitter dataset at Snap Research. In addition, the test data is divided into two groups of 'adding pictures' and 'not adding pictures', and the influence of RGN on different text-image relation type data is eliminated in comparison. It was found that the performance of the "add picture" data hardly changed, but the performance of the "not add picture" data decreased; specifically, the F1 score for the multi-modal tweet dataset at the university of compound denier dropped by 1.2% and the F1 score for the MNER Twitter dataset at Snap Research dropped by 1.5%. This demonstrates that text-independent visual features can negatively impact learning visual language representations.
TABLE 6F 1 score table for RIVA networks before and after RGN excision
Figure BDA0002730638870000131
And (4) experimental conclusion:
the invention relates to a visual feature attention problem when multimodality learning is carried out on a tweet and images are unrelated to texts. The invention provides a multi-modal tweet named entity recognition method based on text-picture relation pre-training, and finds out and relieves the problem that visual features irrelevant to text can generate negative return in multi-modal NER. Experiments show that after a pre-trained multi-modal network model (RIVA model) uses a teacher-student semi-supervised training method under a multi-task framework of text-image relation classification and next word prediction, a text-image relation classification task is finished excellently. The result shows that the performance of the RIVA model is superior to that of visual language models such as ACN and VAM when multi-modal information is fused.

Claims (9)

1. A multi-modal tweet named entity recognition method based on text-picture relation pre-training is characterized by comprising the following steps of:
step 1, large-scale data collection: using the twitter100k dataset as an unlabeled multimodal corpus; merging the image-text relations in the Penbo text-image relation data set into a text-image related relation and a text-image unrelated relation, and dividing the Penbo text-image relation data set into a training set and a test set according to a fixed proportion; selecting a multi-mode tweet data set of the university of the double-denier and an MNER Twitter data set of Snap Research as a data base;
step 2, establishing a pre-trained multi-modal network model for relationship inference and visual attention, wherein the pre-trained multi-modal network model for relationship inference and visual attention comprises the following steps: a text-image relational gating network, an attention-directed visual context network, and a visual language context network;
step 3, pre-training a task;
and 4, applying the pre-trained multi-modal network model to a multi-modal NER task: testing the pre-trained multi-modal model using the biLSTM-CRF model as a reference model for named entity recognition; embedding words into ekInputting into a BilSTM network, conditional random fields using each word-inlayInto ekBlstm hidden vector htTo tag sequences with entity tags; using the pre-trained multi-modal network model, after inputting the text image pair, the hidden outputs of each embedded forward LSTM network and backward LSTM network in the visual language context network are connected to become the visual language context embedding
Figure FDA0002730638860000011
Embedding words into e while performing multi-modal NER taskskBy replacement with
Figure FDA0002730638860000012
2. The method of multi-modal tweet named entity recognition pre-trained based on text-to-picture relationships of claim 1, wherein: the step 2 specifically comprises the following steps:
step 2.1, establishing a text-image relation gating network: completing text-image relation classification by using a full connection layer based on language and visual feature fusion; learning language features of the tweet from the biLSTM network;
step 2.1.1, the words and the concatenation of the characters of the words are jointly input into the bilSTM network, and then the forward output and the backward output of the bilSTM network are concatenated to be used as the coded text vector
Figure FDA0002730638860000013
Wherein d istAs a text vector ftDimension of (1 × d)tAs a text vector ftThe size of the vector space to which the vector belongs;
step 2.1.2, extracting visual feature f from image by using ResNetv(ii) a Based on the output size of the last convolutional layer in ResNet, an averaging pool is used over a fixed area and the whole image is represented as a fixed-dimension vector fv
Step 2.1.3, finally, performing point multiplication on the coded text vector and the coded image vectort⊙fvThen input into FC layer and softmax layer to obtain twoClassification and visual context-gated score sG
Step 2.2, establishing an attention-oriented visual context network;
step 2.2.1, setting
Figure FDA0002730638860000021
Is the regional visual characteristics of a given image, where i 1.. and m, j 1.. n,
Figure FDA0002730638860000022
r is a regional characteristic, dvIn terms of dimension, m × n × dvThe output size of the last convolution layer in ResNet, m × n is the number of regions in the image;
step 2.2.2, capturing local visual features related to the language context using zoom point times attention, which is defined as:
Figure FDA0002730638860000023
in the above formula, matrix Q, matrix K and matrix V represent query, key and value, respectively; dkIs the dimension of the bond;
step 2.2.3, query vector Q using languages=ftAs a query, regional visual features VrAs keys and values; language query vector Q by linear projectionsAnd regional visual characteristics VrConversion to the same dimension:
Figure FDA0002730638860000024
and
Figure FDA0002730638860000025
step 2.2.4, calculate language attention
Figure FDA0002730638860000026
Wherein QsIn the form of a language query vector,
Figure FDA0002730638860000027
for translating a language query vector Qs
Figure FDA00027306388600000213
For regional visual characteristics V after dimension conversionr(ii) a And extending single-path attention to multi-path attention; local visual context VcThe output of (a) is defined as:
Figure FDA0002730638860000028
Figure FDA0002730638860000029
in the above formulae (2) to (3), QsIn the form of a language query vector,
Figure FDA00027306388600000210
for translating a language query vector Qs
Figure FDA00027306388600000211
For regional visual characteristics V after dimension conversionr,VcHead as a local visual contextiFor the output of the local visual context, i 1.., h, h is the total number of local visual context outputs;
step 2.3, establishing a visual language context network, and learning visual language context embedding on a twitter100k data set by using a biLSTM network;
step 2.3.1, first, a visual vector is given
Figure FDA00027306388600000212
And a sequence of length T wt1, T, where sGIs a score for the visual context gating that,Vcfor local visual context, T is the sequence wtLength of };
step 2.3.2, using a Forward LSTM network in (w)1,...wt-1) Up-predicted sequence wtAt time t-0, the forward sequence input is the visual vector
Figure FDA0002730638860000031
While using an inverse LSTM network in (w)t+1,...,wT) Up-predicted sequence wtAt time T +1, the inverted sequence is input as a visual vector
Figure FDA0002730638860000032
Step 2.3.3 adding word embedding [ BOS ] in word sequence]To indicate start, word embedding EOS is also added]Indicating the end, the sequence is expressed as ([ BOS)],w1,...,wT,[EOS]) (ii) a Substitution of visual features for [ BOS ] in forward prediction]Replacing [ EOS ] with visual features in backward prediction](ii) a The word and the word's character embedded concatenation is used as input to the LSTM network.
3. The method of multi-modal tweet named entity recognition pre-trained based on text-to-picture relationships of claim 1, wherein: the step 3 specifically comprises the following steps:
step 3.1, classifying the text-image relation;
step 3.1.1, carrying out text-image relation classification by using a Penbo text-image relation data set to determine whether the content of the image provides effective information except the text;
step 3.1.2, enhancing a text-image relation classification task by adopting a teacher-student semi-supervised mode: firstly, training a teacher model on a twitter100k data set; then, predicting a twitter100K data set by using a teacher model, selecting a twitter with higher score in the text-image related categories, and constructing a new pseudo label training data, wherein the pseudo label training data form a pseudo label 100K data set; finally, in the training of the pre-training multi-modal network model, the text-image relational gating network is taken as a student model, the student model is firstly trained by using pseudo-labeled training data, and noise labeling errors are reduced by fine tuning on a Penbo text-image relational data set;
step 3.1.3, set xi=<texti,imagei>For the tweet pair of text images, the loss of binary relation classification of the data in the penbo text-image relation dataset and the pseudo label 100K dataset is computed by cross entropy:
Figure RE-FDA0002831794510000033
Figure RE-FDA0002831794510000034
in the above formulas (4) to (5),
Figure RE-FDA0002831794510000035
for binary relation classification loss in the penbo text-image relation dataset, Bloomberg for the penbo text-image relation dataset,
Figure RE-FDA0002831794510000036
for binary relation classification loss in the pseudo-label 100K dataset, pseudo-labeled100K is the pseudo-label 100K dataset; x is the number ofiFor image-text pairs, p (x)i) Probability of correct classification, calculated by softmax layer;
and 3.2, predicting the next word: the visual language context network calculates the probability of a sequence by simulating the probability of the next word in the front and back directions, the probability of the sentence being:
Figure RE-FDA0002831794510000041
in the above formula, the probability p (w)tL.) is composed ofCalculated for LSTM networks after the FC layer and the softmax layer,
Figure RE-FDA0002831794510000042
calculating the cross entropy loss of the predicted word; minimize target loss in the fore-aft direction:
Figure RE-FDA0002831794510000043
in the above formula, the first and second carbon atoms are,
Figure RE-FDA0002831794510000044
for target loss in the front-rear direction, { wtT is the sequence { w }, T1, …, T is the sequence { w }tThe length of the strip is,
Figure RE-FDA0002831794510000045
is a visual vector.
4. The method of multi-modal tweet named entity recognition pre-trained based on text-to-picture relationships of claim 1, wherein: in the step 1, the proportion of dividing the Pengbo text-image relation data set into a training set and a test set is 8: 2.
5. the method of multi-modal tweet named entity recognition pre-trained based on text-to-picture relationships of claim 1, wherein: the text-image relational gating network classifies based on the text-image relation in the step 2, and outputs a correlation score between the text and the image, wherein the correlation score is used as a gating control in a path from the attention-oriented visual context network to the visual language context network; the attention-directed visual context network is a visual text attention-based network for extracting local visual information related to text, the output of the attention-directed visual context network is a visual context which is used as an input of the LSTM network to guide the learning of the visual language context network; the visual language context network is a visual language model for performing next word prediction tasks.
6. The method of multi-modal tweet named entity recognition pre-trained based on text-to-picture relationships of claim 2, wherein: the fully connected layer based on the language and visual feature fusion in step 2.1 adopts element-by-element multiplication in the aspect of language and visual feature fusion.
7. The method of multi-modal tweet named entity recognition pre-trained based on text-to-picture relationships of claim 2, wherein: the FC layer in step 2.1.3 is a linear neural network.
8. The method of multi-modal tweet named entity recognition pre-trained based on text-to-picture relationships of claim 1, wherein: the biLSTM-CRF model in step 4 consists of a bidirectional LSTM network and conditional random fields.
9. The method of multi-modal tweet named entity recognition pre-trained based on text-to-picture relationships of claim 3, wherein: the teacher model in step 3.1.2 is an independent network, the structure of which is the same as that of the text-image relationship gating network.
CN202011116968.2A 2020-10-19 2020-10-19 Multi-mode push text named entity recognition method based on text-picture relation pre-training Active CN112257445B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011116968.2A CN112257445B (en) 2020-10-19 2020-10-19 Multi-mode push text named entity recognition method based on text-picture relation pre-training

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011116968.2A CN112257445B (en) 2020-10-19 2020-10-19 Multi-mode push text named entity recognition method based on text-picture relation pre-training

Publications (2)

Publication Number Publication Date
CN112257445A true CN112257445A (en) 2021-01-22
CN112257445B CN112257445B (en) 2024-01-26

Family

ID=74244224

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011116968.2A Active CN112257445B (en) 2020-10-19 2020-10-19 Multi-mode push text named entity recognition method based on text-picture relation pre-training

Country Status (1)

Country Link
CN (1) CN112257445B (en)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112800259A (en) * 2021-04-07 2021-05-14 武汉市真意境文化科技有限公司 Image generation method and system based on edge closure and commonality detection
CN113158875A (en) * 2021-04-16 2021-07-23 重庆邮电大学 Image-text emotion analysis method and system based on multi-mode interactive fusion network
CN113627172A (en) * 2021-07-26 2021-11-09 重庆邮电大学 Entity identification method and system based on multi-granularity feature fusion and uncertain denoising
CN113657115A (en) * 2021-07-21 2021-11-16 内蒙古工业大学 Multi-modal Mongolian emotion analysis method based on ironic recognition and fine-grained feature fusion
CN113704502A (en) * 2021-08-27 2021-11-26 电子科技大学 Multi-mode information fusion account position identification method in social media
CN113704547A (en) * 2021-08-26 2021-11-26 合肥工业大学 Multi-mode label recommendation method based on one-way supervision attention
CN113806564A (en) * 2021-09-22 2021-12-17 齐鲁工业大学 Multi-mode informativeness tweet detection method and system
CN114549850A (en) * 2022-01-24 2022-05-27 西北大学 Multi-modal image aesthetic quality evaluation method for solving modal loss problem
CN114782739A (en) * 2022-03-31 2022-07-22 电子科技大学 Multi-modal classification model based on bidirectional long and short term memory layer and full connection layer
CN115080766A (en) * 2022-08-16 2022-09-20 之江实验室 Multi-modal knowledge graph characterization system and method based on pre-training model
JP2022141587A (en) * 2021-03-15 2022-09-29 ベイジン バイドゥ ネットコム サイエンス テクノロジー カンパニー リミテッド Method and apparatus for acquiring pretraining model
CN116341555A (en) * 2023-05-26 2023-06-27 华东交通大学 Named entity recognition method and system
CN116842141A (en) * 2023-08-28 2023-10-03 北京中安科技发展有限公司 Alarm smoke linkage based digital information studying and judging method
CN116561326B (en) * 2023-07-10 2023-10-13 中国传媒大学 Image text event extraction method, system and equipment based on label enhancement

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105760507A (en) * 2016-02-23 2016-07-13 复旦大学 Cross-modal subject correlation modeling method based on deep learning
CN108829677A (en) * 2018-06-05 2018-11-16 大连理工大学 A kind of image header automatic generation method based on multi-modal attention
US20180365562A1 (en) * 2017-06-20 2018-12-20 Battelle Memorial Institute Prediction of social media postings as trusted news or as types of suspicious news
CN110111399A (en) * 2019-04-24 2019-08-09 上海理工大学 A kind of image text generation method of view-based access control model attention
CN111046668A (en) * 2019-12-04 2020-04-21 北京信息科技大学 Method and device for recognizing named entities of multi-modal cultural relic data
CN111444721A (en) * 2020-05-27 2020-07-24 南京大学 Chinese text key information extraction method based on pre-training language model

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105760507A (en) * 2016-02-23 2016-07-13 复旦大学 Cross-modal subject correlation modeling method based on deep learning
US20180365562A1 (en) * 2017-06-20 2018-12-20 Battelle Memorial Institute Prediction of social media postings as trusted news or as types of suspicious news
CN108829677A (en) * 2018-06-05 2018-11-16 大连理工大学 A kind of image header automatic generation method based on multi-modal attention
CN110111399A (en) * 2019-04-24 2019-08-09 上海理工大学 A kind of image text generation method of view-based access control model attention
CN111046668A (en) * 2019-12-04 2020-04-21 北京信息科技大学 Method and device for recognizing named entities of multi-modal cultural relic data
CN111444721A (en) * 2020-05-27 2020-07-24 南京大学 Chinese text key information extraction method based on pre-training language model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
谢腾;杨俊安;刘辉;: "基于BERT-BiLSTM-CRF模型的中文实体识别", 计算机***应用, no. 07 *

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2022141587A (en) * 2021-03-15 2022-09-29 ベイジン バイドゥ ネットコム サイエンス テクノロジー カンパニー リミテッド Method and apparatus for acquiring pretraining model
CN112800259A (en) * 2021-04-07 2021-05-14 武汉市真意境文化科技有限公司 Image generation method and system based on edge closure and commonality detection
CN113158875B (en) * 2021-04-16 2022-07-01 重庆邮电大学 Image-text emotion analysis method and system based on multi-mode interaction fusion network
CN113158875A (en) * 2021-04-16 2021-07-23 重庆邮电大学 Image-text emotion analysis method and system based on multi-mode interactive fusion network
CN113657115A (en) * 2021-07-21 2021-11-16 内蒙古工业大学 Multi-modal Mongolian emotion analysis method based on ironic recognition and fine-grained feature fusion
CN113657115B (en) * 2021-07-21 2023-06-30 内蒙古工业大学 Multi-mode Mongolian emotion analysis method based on ironic recognition and fine granularity feature fusion
CN113627172A (en) * 2021-07-26 2021-11-09 重庆邮电大学 Entity identification method and system based on multi-granularity feature fusion and uncertain denoising
CN113704547A (en) * 2021-08-26 2021-11-26 合肥工业大学 Multi-mode label recommendation method based on one-way supervision attention
CN113704547B (en) * 2021-08-26 2024-02-13 合肥工业大学 Multimode tag recommendation method based on unidirectional supervision attention
CN113704502B (en) * 2021-08-27 2023-04-21 电子科技大学 Multi-mode information fusion account number position identification method based on social media
CN113704502A (en) * 2021-08-27 2021-11-26 电子科技大学 Multi-mode information fusion account position identification method in social media
CN113806564B (en) * 2021-09-22 2024-05-10 齐鲁工业大学 Multi-mode informative text detection method and system
CN113806564A (en) * 2021-09-22 2021-12-17 齐鲁工业大学 Multi-mode informativeness tweet detection method and system
CN114549850B (en) * 2022-01-24 2023-08-08 西北大学 Multi-mode image aesthetic quality evaluation method for solving modal missing problem
CN114549850A (en) * 2022-01-24 2022-05-27 西北大学 Multi-modal image aesthetic quality evaluation method for solving modal loss problem
CN114782739A (en) * 2022-03-31 2022-07-22 电子科技大学 Multi-modal classification model based on bidirectional long and short term memory layer and full connection layer
CN115080766B (en) * 2022-08-16 2022-12-06 之江实验室 Multi-modal knowledge graph characterization system and method based on pre-training model
CN115080766A (en) * 2022-08-16 2022-09-20 之江实验室 Multi-modal knowledge graph characterization system and method based on pre-training model
CN116341555A (en) * 2023-05-26 2023-06-27 华东交通大学 Named entity recognition method and system
CN116341555B (en) * 2023-05-26 2023-08-04 华东交通大学 Named entity recognition method and system
CN116561326B (en) * 2023-07-10 2023-10-13 中国传媒大学 Image text event extraction method, system and equipment based on label enhancement
CN116842141A (en) * 2023-08-28 2023-10-03 北京中安科技发展有限公司 Alarm smoke linkage based digital information studying and judging method
CN116842141B (en) * 2023-08-28 2023-11-07 北京中安科技发展有限公司 Alarm smoke linkage based digital information studying and judging method

Also Published As

Publication number Publication date
CN112257445B (en) 2024-01-26

Similar Documents

Publication Publication Date Title
CN112257445A (en) Multi-modal tweet named entity recognition method based on text-picture relation pre-training
Nooralahzadeh et al. Progressive transformer-based generation of radiology reports
CN104615608B (en) A kind of data mining processing system and method
CN109635280A (en) A kind of event extraction method based on mark
CN112183094B (en) Chinese grammar debugging method and system based on multiple text features
CN111444367B (en) Image title generation method based on global and local attention mechanism
CN109255027B (en) E-commerce comment sentiment analysis noise reduction method and device
Thirumoorthy et al. Feature selection using hybrid poor and rich optimization algorithm for text classification
CN109189862A (en) A kind of construction of knowledge base method towards scientific and technological information analysis
CN110597979A (en) Self-attention-based generating text summarization method
CN112749274B (en) Chinese text classification method based on attention mechanism and interference word deletion
CN108256968A (en) A kind of electric business platform commodity comment of experts generation method
CN111353306A (en) Entity relationship and dependency Tree-LSTM-based combined event extraction method
CN110516098A (en) Image labeling method based on convolutional neural networks and binary coding feature
CN112989806A (en) Intelligent text error correction model training method
CN110909116A (en) Entity set expansion method and system for social media
CN116127056A (en) Medical dialogue abstracting method with multi-level characteristic enhancement
Liu et al. UAMNer: uncertainty-aware multimodal named entity recognition in social media posts
CN113407697A (en) Chinese medical question classification system for deep encyclopedia learning
He et al. Syntax-aware entity representations for neural relation extraction
CN115422939A (en) Fine-grained commodity named entity identification method based on big data
Peng et al. Relation-aggregated cross-graph correlation learning for fine-grained image–text retrieval
Zhao et al. Aligned visual semantic scene graph for image captioning
CN116628173B (en) Intelligent customer service information generation system and method based on keyword extraction
Bergamaschi et al. Conditional random fields with semantic enhancement for named-entity recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant