CN110020438A - Enterprise or tissue Chinese entity disambiguation method and device based on recognition sequence - Google Patents
Enterprise or tissue Chinese entity disambiguation method and device based on recognition sequence Download PDFInfo
- Publication number
- CN110020438A CN110020438A CN201910297022.1A CN201910297022A CN110020438A CN 110020438 A CN110020438 A CN 110020438A CN 201910297022 A CN201910297022 A CN 201910297022A CN 110020438 A CN110020438 A CN 110020438A
- Authority
- CN
- China
- Prior art keywords
- data
- word
- synonymous
- training
- vector
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
The present invention provides a kind of enterprise based on recognition sequence or tissue Chinese entity disambiguation methods and device, and wherein method includes: to crawl disclosed news data collection and carry out data cleansing, the data after being cleaned;The entity word in data after extracting cleaning, obtains preliminary specification data;Semantic template rule is set, preliminary specification data are screened, are obtained to authority data;The determining synonymous standard words in authority data and synonymous adverbial word, clearly to the synonym pair in authority data;Data mark strategy is set, authority data is treated and is labeled, add artificial constructed data, obtain training data;Pre-training word vector and term vector merge word vector and term vector vertical direction to obtain new vector;Pretreated training data is trained using Encoder Decoder structure building model, saves optimal index model;Forecast sample is treated using optimal index model to be predicted.
Description
Technical field
The present invention relates to entities to disambiguate technical field more particularly to a kind of enterprise or tissue Chinese name based on recognition sequence
Claim entity disambiguation method and device.
Background technique
Entity disambiguates, and concept is to pass through some way to avoid same noun but different meaning bring semantic understanding entanglements
Problem occurs.In recent years, along with the development of artificial intelligence technology, market is for from one section of long text, by Chinese synonym
The demand for accurately identifying out is more and more obvious, and for law, financial industry, this demand is more urgent.And with
The development of natural language processing technique, the entity disambiguation method in Chinese field is also more and more, have currently on the market based on text
The entity disambiguation method of this classification and the entity disambiguation method in knowledge based library and deep learning integrated use.But these technologies are all
It has one disadvantage in that, that is, entity disambiguation problem is converted into text classification problem, there are the following problems: 1. machine learning behind
The model in field can not good extraction text context feature.2. the mode for being converted to text classification is handled, need to judge
Each entity word ambiguity situation, behind need the knowledge base of large amount of complex as support.Such situation will lead to item
In technology building process needed for mesh, there is a situation where it is complex, from cost control with all lacked in performance it is good
Application.
It and is in recent years the series model of structure for text language model treatment side along with using Encoder-Decoder
The rise of formula, this has also given Chinese domain entities disambiguation method to bring new thinking.This model structure, which works as text, contributes a foreword
Column are handled, and input is to build the short text comprising multiple entity words of the same name in advance, and output is each in short text
The corresponding alphanumeric tag of word.This mode can be by character features in extracting, thus good by text context packet
Type containing progressive die is trained.The positionembedding based on text participle proposed simultaneously with Google is as phrase position
The information characteristics set, by this be embedded in neural metwork training processing mode and multilayer attention processing in a manner of, as
Transformer model structure, also for based on the entity of recognition sequence thinking disambiguate technology bring obtain more preferable result can
Energy.
Summary of the invention
The present invention is intended to provide a kind of overcome the problems, such as one of above problem or at least be partially solved any of the above-described base
In enterprise's the analysis of public opinion method and apparatus from attention.
In order to achieve the above objectives, technical solution of the present invention is specifically achieved in that
One aspect of the present invention provides a kind of enterprise based on recognition sequence or tissue Chinese entity disambiguation side
Method, comprising: it crawls disclosed news data collection and data cleansing is carried out to news data collection, the data after being cleaned;It extracts
The entity word in data after cleaning, obtains preliminary specification data;Wherein, entity word includes at least one of: company name COM
With organization name ORG;Semantic template rule is set, preliminary specification data are screened, are obtained to authority data;It determines wait standardize
Synonymous standard words and synonymous adverbial word in data, clearly to the synonym pair in authority data;Data mark strategy is set, is treated
Authority data is labeled, and is added artificial constructed data and is carried out data enhancing, obtains training data;It is pre- using preliminary specification data
Training word vector and term vector merge word vector and term vector vertical direction to obtain new vector;Pre-process training data;
Pretreated training data is trained using Encoder Decoder structure building model, saves optimal index model;
Forecast sample is treated using optimal index model to be predicted, using Beamsearch strategy, is chosen arrangement maximum probability item and is made
For output sequence, the synonymous word sequence of sample to be predicted is obtained.
Wherein, it crawls disclosed news data collection and data cleansing is carried out to news data collection, the data after being cleaned
It include: to crawl the disclosed country, economy, science and technology news data, removal spcial character and meaningless symbol, and check for sky
Value, removes the item data, the data after being cleaned if having null value;The entity word in data after extracting cleaning, obtains just
Step authority data includes: to be handled using Chinese Named Entity Extraction Model trained in advance the data after cleaning, is mentioned
It takes the entity word of company name and organization name in each sentence to be used as training supplement corpus, while carrying out long and short verse and compiling, each
The control of sentence number of words is within default number of words;Semantic template rule is set, preliminary specification data are screened, are obtained wait standardize
Data include: by setting "<ORG>+verb+< ORG>/< COM>+verb+< ORG>/< ORG>+verb+< COM>/< COM>+it is dynamic
High-frequency verb dictionary in word+< COM > " pattern rule and artificial constructed field carries out data screening, obtains to authority data;With/
Or determine in authority data synonymous standard words and synonymous adverbial word, clearly to the synonym in authority data to include: by
Entity word in authority data in a sentence more than number of words as synonymous standard words, determine in sentence another entity word with it is synonymous
Standard words belong to same class, and each word is contained in synonymous standard word in another entity word, and another reality
The number of words of pronouns, general term for nouns, numerals and measure words is greater than 1, then using another entity word as synonymous adverbial word, determines the synonymous standard words in sentence and synonymous adverbial word
Belong to synonym pair.
Wherein, setting data mark strategy, treats authority data and is labeled, and adds artificial constructed data and carries out data increasing
By force, obtaining training data includes: that will mark SEi to the first character in authority data in each sentence in synonymous standard words,
Other words in synonymous standard words are labeled as Ii, and the first character in synonymous adverbial word is labeled as E2, other words in synonymous adverbial word
It is labeled as I2, word of other in sentence not in synonym pair is labeled as O;The artificial supplementation data for meeting semantic template rule are added
Enter in preliminary specification data, obtain training data, wherein artificial supplementation data are mixed with to authority data by random alignment processing
It is combined;Using preliminary specification data pre-training word vector and term vector, word vector and term vector vertical direction are closed
And obtaining new vector includes: the Word2vec model using Skip structure, training obtains word vector, after participle is added in entity word
The dictionary of formation, training obtain term vector, and term vector is carried out vertical direction with word vector and merges to obtain new vector;And/or it is pre-
Processing training data includes: that training data is isolated annotated sequence and Chinese sequence, carries out stop words filtering to Chinese sequence,
Establish dictionary and according to dictionary index by text sequence coded treatment.
Wherein, pretreated training data is trained using Encoder Decoder structure building model, is saved
Optimal index model includes: to use structure for the model of Encoder-Decoder structure, in encoder encoder, by setting
The convolutional neural networks abstraction sequence feature that convolution nucleus number is respectively 3,4,5 is set, passes through a forward-backward recutrnce neural network respectively
It is serialized, self attention is added and generates the intermediate state that corresponding attention weight is exported as the end Encoder
Value constitutes decoder by 2 layers of forward-backward recutrnce neural network at the end decoder, respectively that the target sequence of previous moment is defeated
Enter in decoder, is acted on intermediate state layer generation, generate the target sequence of next time step.
Wherein, treating forecast sample to carry out prediction using optimal index model includes: using Beamsearch strategy, setting
Beamsearch size value is 3, chooses arrangement maximum probability item as output sequence, obtains the synonymous word order of sample to be predicted
Column.
Another aspect of the present invention provides a kind of enterprise based on recognition sequence or tissue Chinese entity disambiguator,
It include: that data set building module obtains clear for crawling disclosed news data collection and carrying out data cleansing to news data collection
Data after washing;The entity word in data after extracting cleaning, obtains preliminary specification data;Wherein, entity word include with down toward
It is one of few: company name COM and organization name ORG;Semantic template rule is set, preliminary specification data are screened, are obtained wait advise
Norm evidence;The determining synonymous standard words in authority data and synonymous adverbial word, clearly to the synonym pair in authority data;Data
Labeling module is treated authority data and is labeled, add artificial constructed data and carry out data increasing for setting data mark strategy
By force, training data is obtained;Vector training module, for utilize preliminary specification data pre-training word vector and term vector, by word to
Amount merges to obtain new vector with term vector vertical direction;Preprocessing module, for pre-processing training data;Model training mould
Block saves optimal finger for being trained using Encoder Decoder structure building model to pretreated training data
Mark model;Prediction module is predicted for treating forecast sample using optimal index model, tactful using Beamsearch,
Arrangement maximum probability item is chosen as output sequence, obtains the synonymous word sequence of sample to be predicted.
Wherein, data set building module crawls disclosed news data collection in the following way and carries out to news data collection
Data cleansing, the data after being cleaned: data set constructs module, new specifically for crawling the disclosed country, economy, science and technology
Data, removal spcial character and meaningless symbol are heard, and checks for null value, the item data is removed if having null value, is obtained clear
Data after washing;Data set building module extracts the entity word in the data after cleaning in the following way, obtains preliminary specification
Data: data set constructs module, specifically for using Chinese Named Entity Extraction Model trained in advance to the number after cleaning
According to being handled, extracts the entity word of company name and organization name in each sentence and be used as training supplement corpus, while carrying out long and short verse
It compiles, each sentence number of words control is within default number of words;Semantic mould is arranged in data set building module in the following way
Plate gauge then, screens preliminary specification data, obtains to authority data: data set constructs module, specifically for passing through setting
"<ORG>+verb+< ORG>/< COM>+verb+< ORG>/< ORG>+verb+< COM>/< COM>+verb+< COM>" pattern rule
With high-frequency verb dictionary in artificial constructed field, data screening is carried out, is obtained to authority data;And/or data set constructs module
Be determined as follows in authority data synonymous standard words and synonymous adverbial word, clearly to the synonym in authority data
Right: data set constructs module, specifically for will be to the entity word in authority data in a sentence more than number of words as synonymous standard
Word determines that another entity word and synonymous standard words belong to same class, and each word in another entity word in sentence
It is contained in synonymous standard word, and the number of words of another entity word is greater than 1, then using another entity word as synonymous adverbial word,
Determine that synonymous standard words and synonymous adverbial word in sentence belong to synonym pair.
Wherein, data labeling module sets data mark strategy in the following way, treats authority data and is labeled, adds
Add artificial constructed data to carry out data enhancing, obtain training data: data labeling module, being specifically used for will be to every in authority data
First character in one sentence in synonymous standard words marks SEi, other words in synonymous standard words are labeled as Ii, synonymous adverbial word
In first character be labeled as E2, other words in synonymous adverbial word are labeled as I2, word mark of other in sentence not in synonym pair
Note is O;The artificial supplementation data for meeting semantic template rule are added in preliminary specification data, obtain training data, wherein people
Work supplementary data is mixed with to authority data by random alignment processing;Vector training module utilizes in the following way
Preliminary specification data pre-training word vector and term vector merge word vector and term vector vertical direction to obtain new vector:
Vector training module, specifically for the Word2vec model using Skip structure, training obtains word vector, and entity word is added and is divided
The dictionary formed after word, training obtain term vector, and term vector is carried out vertical direction with word vector and merges to obtain new vector;With/
Or preprocessing module pre-processes training data in the following way: preprocessing module, specifically for training data is separated bid
Sequence and Chinese sequence are infused, stop words filtering is carried out to Chinese sequence, dictionary is established and compiles text sequence according to dictionary index
Code processing.
Wherein, model training module utilizes Encoder Decoder structure building model to pretreatment in the following way
Training data afterwards is trained, and saves optimal index model: model training module, is specifically used for using structure for Encoder-
The model of Decoder structure, in encoder encoder, the convolutional neural networks for being respectively 3,4,5 by the way that convolution nucleus number is arranged
Abstraction sequence feature is serialized by a forward-backward recutrnce neural network respectively, and self attention is added and generates phase
The intermediate state value that the attention weight answered is exported as the end Encoder passes through 2 layers of forward-backward recutrnce nerve net at the end decoder
Network constitutes decoder, respectively by the target sequence inputting decoder of previous moment, acts on intermediate state layer generation, raw
At the target sequence of next time step.
Wherein, prediction module is treated forecast sample using optimal index model in the following way and is predicted: prediction mould
Block is specifically used for using Beamsearch strategy, and setting Beamsearch size value is 3, chooses arrangement maximum probability item conduct
Output sequence obtains the synonymous word sequence of sample to be predicted.
It can be seen that the enterprise or tissue Chinese entity disambiguation side provided in an embodiment of the present invention based on recognition sequence
Method and device crawl news corpus, thus the effective number of building by constructing simple semantic template and artificial data labeling form
According to collection, in this, as model training basis;500w news corpus building pre-training word word vector, word vector are crawled to obtain using itself
Dimension is controlled in 500 dimensions;Carry out model construction, propose it is a kind of can be from language model, above and below good extraction text sequence
The model structure of literary information.
Detailed description of the invention
In order to illustrate the technical solution of the embodiments of the present invention more clearly, required use in being described below to embodiment
Attached drawing be briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for this
For the those of ordinary skill in field, without creative efforts, it can also be obtained according to these attached drawings other
Attached drawing.
Fig. 1 is the flow chart provided in an embodiment of the present invention based on enterprise's the analysis of public opinion method from attention;
Fig. 2 is dataset construction process schematic provided in an embodiment of the present invention;
Fig. 3 is that data provided in an embodiment of the present invention mark tactful schematic diagram;
Fig. 4 is model E ncoder end structure schematic diagram provided in an embodiment of the present invention;
Fig. 5 is model Decoder end structure schematic diagram provided in an embodiment of the present invention;
Fig. 6 is the structural schematic diagram provided in an embodiment of the present invention based on enterprise's the analysis of public opinion device from attention.
Specific embodiment
Exemplary embodiments of the present disclosure are described in more detail below with reference to accompanying drawings.Although showing the disclosure in attached drawing
Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure without should be by embodiments set forth here
It is limited.On the contrary, these embodiments are provided to facilitate a more thoroughly understanding of the present invention, and can be by the scope of the present disclosure
It is fully disclosed to those skilled in the art.
Enterprise or tissue Chinese entity disambiguation method and device provided in an embodiment of the present invention based on recognition sequence
Purpose be:
Firstly, solve in little Wei enterprise reference public sentiment system, when user inputs Sentence Search enterprise or organization name,
Whether discrimination there is the case where synonymous entity word, if this all class name will be regarded as same target by the telephone system occurred, from
And avoid recall precision caused by entity ambiguity slowly and the generation of the situation of user experience bad luck.
Secondly, the present invention knows according to the thinking for converting entity disambiguation problem to sequence labelling problem getting rid of complexity
In the case where knowing library, using seq2seq model structure, a kind of new Chinese entity word disambiguation method is proposed.
Fig. 1 shows the enterprise provided in an embodiment of the present invention based on recognition sequence or tissue Chinese entity disambiguation side
The flow chart of method, referring to Fig. 1, the enterprise or tissue Chinese entity provided in an embodiment of the present invention based on recognition sequence is disambiguated
Method, comprising:
S1 crawls disclosed news data collection and carries out data cleansing to news data collection, the data after being cleaned.
Specifically, disclosed news data is crawled first, and data cleansing is carried out to the disclosed news data collection crawled
(detailed process may refer to Fig. 2) crawls disclosed news data collection as an optional embodiment of the embodiment of the present invention
And data cleansing carried out to news data collection, the data after being cleaned include: that crawl the disclosed country, economy, science and technology new
Data, removal spcial character and meaningless symbol are heard, and checks for null value, the item data is removed if having null value, is obtained clear
Data after washing.
S2, the entity word in data after extracting cleaning, obtains preliminary specification data;Wherein, entity word include with down toward
It is one of few: company name COM and organization name ORG;
As an optional embodiment of the embodiment of the present invention, the entity word in the data after cleaning is extracted, is obtained just
Step authority data includes: to be handled using Chinese Named Entity Extraction Model trained in advance the data after cleaning, is mentioned
It takes the entity word of company name and organization name in each sentence to be used as training supplement corpus, while carrying out long and short verse and compiling, each
The control of sentence number of words is within default number of words.
When it is implemented, can be real by the Chinese name built in advance in advance by the disclosed news data collection crawled
Body Model (NER system) is handled, the entity with company name (COM), organization name (ORG) in the text that news data is concentrated
Word extracts the supplement corpus used as term vector training, and carries out long and short verse and compile, each sentence length control
Within 500 words.
S3, setting semantic template rule, screens preliminary specification data, obtains to authority data.
Specifically, as an optional embodiment of the embodiment of the present invention, semantic template rule is set, to preliminary specification
Data are screened, obtain include: to authority data by setting "<ORG>+verb+< ORG>/< COM>+verb+< ORG>/
High-frequency verb dictionary in < ORG >+verb+< COM >/< COM >+verb+< COM > " pattern rule and artificial constructed field, is counted
According to screening, obtain to authority data.
It is extracted when it is implemented, each newsletter archive is inputted the sentence with these two types of entity words in example, it
Pass through this kind of simple semantic mould of setting "<ORG>+verb+<ORG>/<COM>+verb+<ORG>/<ORG>+verb+<COM>" afterwards
Plate gauge is then screened, will such as " China Construction Bank is referred to as Construction Bank, is seated Shanghai " it is this kind of meet it is above-mentioned rule
Simple sentence extract as data set to be standardized.Wherein, it is moved among company name and organization name field inner template rule
Word will be quoted by the way of manually choosing high-frequency verb dictionary.After handling above-mentioned data well, obtained data set
Feature is the entity word that each sentence contains only ORG or COM one kind, and each sentence only has 2~4 entities
Word.
S4, the determining synonymous standard words in authority data and synonymous adverbial word, clearly to the synonym pair in authority data.
As an optional embodiment of the embodiment of the present invention, determine in authority data synonymous standard words with it is synonymous
Adverbial word, clearly to the synonym in authority data to include: will in authority data in a sentence more than number of words entity word make
For synonymous standard words, determine that another entity word and synonymous standard words belong to same class in sentence, and in another entity word
Each word is contained in synonymous standard word, and the number of words of another entity word be greater than 1, then using another entity word as
Synonymous adverbial word determines that the synonymous standard words and synonymous adverbial word in sentence belong to synonym pair.
When it is implemented, the entity word in each sentence is extracted, and chooses sentence after previous action
The more word of middle entity word number of words as the synonymous standard words of this, if another entity word belong to the standard word it is same
Class, and each word is contained in the standard word in another entity word, and its number of words is greater than 1, then and the word is claimed
For synonymous adverbial word, while two entity words just belong to synonym pair in the sentence.For example " China Construction Bank is referred to as in short sentence
Construction Bank, be one of biggish nationalized bank, China " in, entity word " China Construction Bank " and " Construction Bank " belong to company name
(ORG) this kind of, while each of entity word " Construction Bank " word is all contained in entity word " China Construction Bank ", therefore
The two words belong to synonym pair, and the longest word of number of words is then synonymous standard words.In a sentence, if there is synonymous mark
Quasi- word and synonymous adverbial word both then show semantically to there is identical meaning, can table name adverbial word semantic information, it can be eliminated
Ambiguity property itself.
S5, setting data mark strategy, treats authority data and is labeled, add artificial constructed data and carry out data increasing
By force, training data is obtained.
As an optional embodiment of the embodiment of the present invention, data mark strategy is set, authority data progress is treated
Mark adds artificial constructed data and carries out data enhancing, obtain training data include: will be in authority data in each sentence
First character in synonymous standard words marks SEi, other words in synonymous standard words are labeled as Ii, and first in synonymous adverbial word
Word is labeled as E2, other words in synonymous adverbial word are labeled as I2, and word of other in sentence not in synonym pair is labeled as O;It will symbol
The artificial supplementation data for closing semantic template rule are added in preliminary specification data, obtain training data, wherein artificial supplementation data
It is mixed with to authority data by random alignment processing.
When it is implemented, it is to belong in a sentence that the mark strategy of data used in the embodiment of the present invention is (shown in Figure 3)
In two words of synonym pair, the longest word of number of words is determined as synonymous standard words, while its first character is labeled as letter
E1, other words of the word are designated as I1;The less synonym of another number of words then therewith similarly, belongs to adverbial word, first character is labeled as
E2, other words of the word are labeled as I2.Word unrelated with entity word in this is all labeled as alphabetical O later, and other numbers of words are few
In the similar entity word of synonym standard, first character is labeled as E2, other words of the word are labeled as I2.For example, in sentence " China
Construction Bank is referred to as Construction Bank or Construction Bank, is one of Chinese large bank " in, " China Construction Bank " is the same of this
Adopted standard words are then labeled as " E1I1I1I1I1I1 ", and " Construction Bank " is then labeled as " E2I2 " as corresponding adverbial word, " Construction Bank "
Notation methods and " Construction Bank " are similarly.Finally, other words are then labeled as " O " in the sentence.Above-mentioned semantic template rule will met
Artificial supplementation data are added in training set, to play data humidification.Wherein, this kind of artificial supplementation data have reached 20,000
Text, and crawl the text handled well with before and upset and mix, it have passed through random alignment processing.
S6 is carried out word vector and term vector vertical direction using preliminary specification data pre-training word vector and term vector
Merging obtains new vector.
As an optional embodiment of the embodiment of the present invention, using preliminary specification data pre-training word vector and word to
Word vector and term vector vertical direction are merged to obtain new vector to include: the Word2vec model using Skip structure by amount,
Training obtains word vector, and the dictionary that is formed after participle is added in entity word, and training obtains term vector, by term vector and word vector into
Row vertical direction merges to obtain new vector.
When it is implemented, it is 5 that pre-training word vector used in the embodiment of the present invention, which is in setting parameter window, by crawling
5,000,000 sentences in, obtained from the sentence being consistent with the semantic template rule of building is trained.The embodiment of the present invention uses
The Word2vec model of Skip structure, the word vector dimension finally obtained are 500 dimensions.Meanwhile the embodiment of the present invention is also by text
It is segmented, by what is formed after the company name (COM) obtained in above-mentioned steps S1 or organization name (ORG) entity word addition participle
Dictionary, to carry out term vector training, gained term vector dimension is also 500 dimensions.In order to preferably extract the background that text contains
Obtained term vector is carried out vertical direction with word vector and merges to obtain new vector by information, the embodiment of the present invention.Then by institute
Obtained new vector carries out insertion training pattern as pre-training vector.
S7 pre-processes training data.
As an optional embodiment of the embodiment of the present invention, pre-processing training data includes: to separate training data
Annotated sequence and Chinese sequence out carry out stop words filtering to Chinese sequence, establish dictionary and according to dictionary index by text sequence
Column coded treatment.
When it is implemented, carrying out the operation of text data cleaning and data prediction to training data.The text that will be obtained
Notebook data is concentrated, and removes additional character with regular expression, introduces the deactivated vocabulary built in advance later, is removed in text
It is meaningless for training such as " " " obtaining " etc auxiliary words of mood.It, will later by the sentence marked by Chinese word segmentation
It, which is processed into needed for training, obtains training data text input sequence, that is, characterizes are as follows: X1, X2 ..., Xn.It simultaneously will corresponding sentence
The mark field of subsequence is processed into the target data text output sequence at T0 moment, that is, characterizes are as follows: Y1, Y2 ..., Yn.It
The sequence sentence end target data is spliced afterwards "<EOS>" identifier, it indicates sequence prediction end position, at this time obtains
New target sequence be known as T1 moment text output sequence, characterize are as follows: Y1, Y2 ..., Yn,<EOS>.
S8 is trained pretreated training data using Encoder Decoder structure building model, saves most
Excellent index model.
As an optional embodiment of the embodiment of the present invention, model pair is constructed using Encoder Decoder structure
Pretreated training data is trained, and saving optimal index model includes: to use structure for Encoder-Decoder structure
Model, in encoder encoder, by be arranged convolution nucleus number be respectively 3,4,5 convolutional neural networks abstraction sequences it is special
Sign is serialized by a forward-backward recutrnce neural network respectively, and self attention is added and generates corresponding attention
The intermediate state value that weight is exported as the end Encoder constitutes decoding by 2 layers of forward-backward recutrnce neural network at the end decoder
Device acts on, when generating next respectively by the target sequence inputting decoder of previous moment with intermediate state layer generation
Between step-length target sequence.
When it is implemented, model structure used in the embodiment of the present invention is Encoder-Decoder structure, which is
By inputting one section of Training data text sequence, the sequence is handled by encoder (Encoder), in generation
Between hidden layer C.It is generated again by the Target data text sequence at input T0 moment by embedding layers of processing later
Matrix (Target | t=t0), which is added, obtains matrix [C, Target | T=T0].And decoder (Decoder) end is then by this matrix
As input, in order to predict Target Data sequence that next period T1 should be exported.Wherein convolutional neural networks
(CNN) can good extraction text local message, but due to the influence of local receptor field, this network is not suitable for extracting
In the text with the entity relationship of long range.And it is in parallel by the convolutional network of different convolution kernels, then it can carry out further
Extract text sequence information.Simultaneously because this kind of model based on Recognition with Recurrent Neural Network (RNN) structure of BIlstm is able to solve sentence
Distance length is too long between two entity words in son or two entity words between the dislocation this kind of comprising third entity word again
Three continuous convolution kernels are respectively set in model E ncoder structure (as shown in Figure 4) in situation, therefore, the embodiment of the present invention
Number carries out convolution text maninulation, after each later layer of CNN network by one layer of BIlstm in order to alleviating entity word long distance
From problem brought by study.Finally by one layer of self-attention machining function, each word in text sequence is obtained
Hidden state C.Secondly, being embedded in the stage in word, closed using word vector and each sentence by segmenting obtained participle vector
And embedding operation is carried out, the effect learnt with this lift scheme.
It, will be from hidden layer tensor obtained in Encoder and T0 moment in model Decoder structure (as shown in Figure 5)
Target tensor be added, be finally T1 by the obtained data of softmax to be trained in input model
The data at moment.Wherein it should be noted that target data T0 and T1 moment are obtained by data preprocessing phase.
The embodiment of the present invention is using sentence reference character obtained in each sentence as target data sequence, and phase
Corresponding sentence is training data sequence.By splicing "<EOS>" character to every a line target data, sequence is indicated
Predict end position.At this time it can find that this belongs to the data at T1 moment after the data are aligned with target initial data.
And corresponding is that the data at T0 moment are Target original data.Finally obtained Encoder and Decoder is spelled
It picks up to train, model training stage as provided in an embodiment of the present invention.
Detailed descriptionthe model mathematical theory and careful operating procedure can be carried out model by following steps as follows
Training:
S81, setting convolution nucleus number are respectively 3,4,5 convolutional neural networks abstraction sequence feature:
The word vector transverse dimensions of setting of the embodiment of the present invention are d, then pass through the obtained tensor dimension size of embeding layer
For xi=(batch, d), can characterize are as follows:WhereinRepresenting matrix is pressed
Row carries out concatenation, and n represents input example number.By be arranged different word window size h come rate sequence information, it is final every
It is a as follows by the obtained eigenmatrix calculation formula of every layer of word window:
ci=f (w1xI:i+h-1+b)
Function f indicates that convolutional network activation primitive, i refer to tensor x in above-mentioned formulaiLine number, b represents the biasing in network
, W1Hyper parameter to be trained is represented, is initialized as 0.After convolution operation, sequence { X1:h, X2:h+1..., XN-h+1:nIn
Corresponding characteristic set D is produced, structure can be expressed as:
D=[c1, c2..., cn-h+1], D ∈ Rn-h+1
It closes in obtained D collection, is operated by maxpooling, choose Dj=max { D } as word window size be h when
Obtained feature vector.
S82 is arranged Dropout layers, and neuron ratio setting is 50%:
Over-fitting is trained in order to prevent, and Dropout layers are provided with after it have passed through maximum pond layer, makes it
The neuron of middle random screening 50% stops parameter and updates, and retains weight, and other 50% can still be declined by gradient
Mode carries out parameter update.
S83 is arranged Bilstm layers, and neuron number is set as word vector transverse dimensions d:
For the contextual properties of tensor between good extraction neural network node, and reduce brought by long-distance dependence
It influences, the embodiment of the present invention is handled by adding Bidirectional Lstm layers, and makes tensor by process of convolution
Afterwards saved good sequence characteristic.Eigenmatrix D after Dropout layers of screeningj, it is input to subnetwork update
Formula is as follows:
Forward direction updates calculating formula: ht1=f1(w21Dj+v21ht-1+b1)
It is backward to update calculating formula: ht2=f1(w22Dj+v22ht+1+b2)
Hidden layer output type: G=g (U | ht1;ht2|+c)
Wherein ht1Indicate the hidden state amount generated when LSTM forward calculation, ht2Indicate reversed (i.e. backward) calculating of LSTM
The hidden layer of Shi Shengcheng.Wherein w, v indicate that training parameter, b are mutually deserved bias term, are initialized to 0.It finally obtains to open
Measure characteristic sequence obtained by G is.A total of 3 identical branches are mentioned in parallel-connection structure progress text information in the embodiment of the present invention
It takes, therefore acquired that characteristic sequence number-letter relation table identifies are as follows: G1, G2, G3.
S84 introduces Concatenate layers, merges to three obtained tensors of branch:
After previous processed, the sequence tensor that obtains respectively is respectively G1, G2, G3.Choose sequence tensor the last one
Dimension, which carries out splicing, indicates the vector that each word is constituted to obtain new sequence every a line of tensor Gn, tensor Gn.
S85 introduces self-attention mechanism, generates corresponding hidden state layer C:
Herein will by embeding layer, treated that sequence tensor x regards source sequence as, be characterized as (X1, X2... ..., Xn), ruler
Very little size is (batch, d).Sequence Gn is regarded as and is made of<key.value>such a key-value pair.And it is above-mentioned by insertion
Operating obtained sequence tensor X must calculate as Query to carry out attention weighted value.Since Gn results from Xi, therefore will
Attention mechanism is placed in Encoder structure and is trained.Query, key, value calculating formula, process are first defined first
It is as follows:
Query=WQX
Key=WKGn
Value=WVGn
Wherein WQ、WK、WVFor to training parameter.Then it is indicated i-th in Query and sequence by calculating cosine similarity
A keyword Keyi obtains similarity degree, then calculating process is as follows:
Simi=(Query × Keyi)/(||Query||×||Keyi||)
To which with softmax method the above results are normalized with the attention power that can be obtained by each key
Weight ai.Calculating process is as follows, and wherein i indicates that each key is worth index, KeyiWhole expression, which is appointed, takes building every a line of Gn
Word vector:
Hidden state parameter C in last EncoderjCalculation formula is as follows, and each keyword Keyi generates to obtain state ginseng
Number CjIt may be different, wherein letter a indicates corresponding attention weight ai, L expression sequence length:
S86, BILSTM layer of two layers of construction are composed in series the end Decoder, completion model training stage:
Obtaining the hidden state layer C of the end Encoder generationjAfterwards, data prediction is obtained obtaining the T0 moment by us
Target sequence data (Y1, Y2..., Yn) Embedding operation is carried out, matrix Y is obtained, by CjIt is defeated as the end Decoder with Y
Enter result finally after softmax is acted on, exports the Target sequence data (Y at T1 moment1, Y2..., Yn,<EOS>).It is counted
Formula is as follows:
Y|T=T1=f1 (Cj, Y |T=T0)
T1=T0+1
Wherein f1 refers to that decoder terminal type non-linear changes function, and T1 is T0 subsequent time.We select cross entropy to damage in training
Function is lost as loss value Gradient Iteration process in training.Finally we combine loss change curve to choose suitable epoch, and
The optimal model of index is saved after carrying out tune ginseng.
S9 treats forecast sample using optimal index model and is predicted, using Beamsearch strategy, it is general to choose arrangement
Rate maximal term obtains the synonymous word sequence of sample to be predicted as output sequence.
As an optional embodiment of the embodiment of the present invention, forecast sample is treated using optimal index model and is carried out in advance
Survey includes: using Beamsearch strategy, and setting Beamsearch size value is 3, chooses arrangement maximum probability item as output
Sequence obtains the synonymous word sequence of sample to be predicted.
When it is implemented, being pre-processed to Training data data, every data line in the forecast period of model
Splice "<GO>" character in front, indicates forecasting sequence starting position.Model prediction stage and training stage weight are shared,
Therefore resulting text sequence can be subjected to predicted operation after embedding treated tensor input.In pre- test sample
In this, the word that the word with " EI ", " I1 ", " E2 ", " I2 " this kind of mark in corresponding sentence is constituted is as Chinese synonymous
Word.
In the model prediction stage, due to lacking target data.Therefore the sequence data collection that will be predicted is processed into coding
After form, by its sequence end addition<EOS>sequence ends symbol.Beemsearch size=3 is set later, then by sequence
It is input in model, using the sequence of previous time step as the input of next time step, each time step is exported all
Stop prediction after final algorithm traverses<EOS>in sequence mark for the probability matrix of different text sequences.Finally, it selects
Maximum one section of text sequence is selected in all output probabilities as forecasting sequence.
Assuming that there was only A in certain corpus, two words of B, then training process is as follows: first time period, which exports, generates A, B two
The probability P (A) of word, P (B), then with [A, B]TAs the input of next time step, then this moment will export P (AB
| A), this kind of sequence probability matrix of P (AA | A), P (AB | B) ... ... P (BB | B).Subsequent time period and so on, but whenever sequence
After generating 3 words, it will selection retains in all sequences of 3 words in front arrangement, that maximum sequence retains in probability
Get off.It eventually passes through above-mentioned all policies and just produces the synonymous mark sequence of entity word.After above-mentioned predicted operation, it can obtain
The synonym of enterprise or this kind of entity word of tissue is to situation in a complete words, thus when avoiding user search, it is different
Synonymous adverbial word brought by semantic understanding problem, the ambiguousness of title class entity word can be effectively reduced.User is using system
During system, it also will increase synonym in backstage dynamic and collect.If user inputs relevant enterprise or tissue initialism, system
Also the target quick positioning user searched, this will also greatly promote recall precision.
It can be seen that is provided through the embodiment of the present invention is disappeared based on sequence right enterprise or tissue Chinese entity
Discrimination method brings possibility so that the disambiguation of Chinese entity is applied in the business of enterprise's the analysis of public opinion.The embodiment of the present invention mentions
The method of confession has the advantages that
1, by constructing simple semantic template and artificial data labeling form, news corpus is crawled, so that building is effective
Data set, in this, as model training basis.
2,500w news corpus building pre-training word word vector is crawled to obtain using itself, the control of word vector dimension is in 500 dimensions
Degree.
3, carry out model construction, propose it is a kind of can from language model, good extraction text sequence contextual information
Model structure.
Wherein, disambiguate that there are data sets to lack according to Chinese entity, existing method can not quantify and need to make a concrete analysis of
Feature, this method first proposed a kind of scheme for constructing labeled data collection, realize and turn chaotic urtext data
Turn to this function of training data under supervised learning.Secondly, method proposes a kind of new model structures to carry out text sequence
Column processing, this method have the advantages that compared to conventional method
1, the conventional method that Chinese entity is disambiguated and is converted to classification processing is got rid of, such methods are largely known by building
Know library and carry out rule match, to carry out synonym.This method cost is excessively complicated, inconvenient.
2, it for tradition is based on this kind of statistical machine learning method of Hidden Markov, is generated relative to word frequency mode
For text vector, this method can preferably extract text sequence feature, can be improved model and disambiguate this for Chinese entity
Adaptability under kind scene.
3, this method can preferably cope with text sequence study such case over long distances relative to conventional method.
Fig. 6 shows the enterprise provided in an embodiment of the present invention based on recognition sequence or tissue Chinese entity disambiguates dress
The structural schematic diagram set is somebody's turn to do enterprise or tissue Chinese entity disambiguator based on recognition sequence and is based on sequence applied to above-mentioned
Arrange identification enterprise or tissue Chinese entity disambiguation method, below only to based on recognition sequence enterprise or tissue Chinese name
Claim entity disambiguator structure be briefly described, other unaccomplished matters, please refer to the above-mentioned enterprise based on recognition sequence or
The related description of Chinese entity disambiguation method is organized, details are not described herein.Referring to Fig. 6, base provided in an embodiment of the present invention
Enterprise or tissue Chinese entity disambiguator in recognition sequence, comprising:
Data set constructs module 601, for crawling disclosed news data collection and carrying out data cleansing to news data collection,
Data after being cleaned;The entity word in data after extracting cleaning, obtains preliminary specification data;Wherein, entity word includes
At least one of: company name COM and organization name ORG;Semantic template rule is set, preliminary specification data are screened, are obtained
To authority data;The determining synonymous standard words in authority data and synonymous adverbial word, clearly to the synonym in authority data
It is right;
Data labeling module 602 is treated authority data and is labeled, add artificial structure for setting data mark strategy
It builds data and carries out data enhancing, obtain training data;
Vector training module 603, for utilizing preliminary specification data pre-training word vector and term vector, by word vector and word
Vector vertical direction merges to obtain new vector;
Preprocessing module 604, for pre-processing training data;
Model training module 605, for constructing model to pretreated trained number using Encoder Decoder structure
According to being trained, optimal index model is saved;
Prediction module 606 is predicted for treating forecast sample using optimal index model, uses Beamsearch plan
Slightly, arrangement maximum probability item is chosen as output sequence, obtains the synonymous word sequence of sample to be predicted.
It can be seen that the enterprise based on recognition sequence or tissue Chinese entity that provide through the embodiment of the present invention disappear
Discrimination device brings possibility so that the disambiguation of Chinese entity is applied in the business of enterprise's the analysis of public opinion.
As an optional embodiment of the embodiment of the present invention, data set building module 601 crawls in the following way
Disclosed news data collection simultaneously carries out data cleansing to news data collection, and the data after being cleaned: data set constructs module
601, specifically for crawling the disclosed country, economy, science and technology news data, removal spcial character and meaningless symbol, and examine
It looks into whether there is or not null value, the item data, the data after being cleaned is removed if having null value.
As an optional embodiment of the embodiment of the present invention, data set building module 601 is extracted in the following way
The entity word in data after cleaning, obtain preliminary specification data: data set constructs module 601, is specifically used for using instruction in advance
The Chinese Named Entity Extraction Model perfected handles the data after cleaning, extracts company name and organization name in each sentence
Entity word carries out long and short verse and compiles as training supplement corpus, and each sentence number of words control is within default number of words.
As an optional embodiment of the embodiment of the present invention, data set building module 601 is arranged in the following way
Semantic template rule, screens preliminary specification data, obtain to authority data: data set constructs module 601, is specifically used for
Pass through setting "<ORG>+verb+< ORG>/< COM>+verb+< ORG>/< ORG>+verb+< COM>/< COM>+verb+< COM>"
High-frequency verb dictionary in pattern rule and artificial constructed field carries out data screening, obtains to authority data.
As an optional embodiment of the embodiment of the present invention, data set building module 601 is determined as follows
To in authority data synonymous standard words and synonymous adverbial word, clearly to the synonym pair in authority data: data set construct module
601, specifically for that will be determined another in sentence to the entity word in authority data in a sentence more than number of words as synonymous standard words
One entity word and synonymous standard words belong to same class, and each word is contained in synonymous standard words in another entity word
In language, and the number of words of another entity word is greater than 1, then using another entity word as synonymous adverbial word, determines synonymous in sentence
Standard words and synonymous adverbial word belong to synonym pair.
As an optional embodiment of the embodiment of the present invention, data labeling module 602 sets number in the following way
It according to mark strategy, treats authority data and is labeled, add artificial constructed data and carry out data enhancing, obtain training data: number
According to labeling module 602, specifically for that will be marked to the first character in authority data in each sentence in synonymous standard words
SEi, other words in synonymous standard words are labeled as Ii, and the first character in synonymous adverbial word is labeled as E2, its in synonymous adverbial word
His word is labeled as I2, and word of other in sentence not in synonym pair is labeled as O;The artificial supplementation number of semantic template rule will be met
According to being added in preliminary specification data, training data is obtained, wherein artificial supplementation data are passed through at random alignment with to authority data
Reason mixes.
As an optional embodiment of the embodiment of the present invention, vector training module 603 is in the following way using just
Authority data pre-training word vector and term vector are walked, merges word vector and term vector vertical direction to obtain new vector: to
Training module 603 is measured, specifically for the Word2vec model using Skip structure, training obtains word vector, entity word is added
The dictionary formed after participle, training obtain term vector, and term vector is carried out vertical direction with word vector and merges to obtain new vector.
As an optional embodiment of the embodiment of the present invention, preprocessing module 604 pre-processes instruction in the following way
Practice data: preprocessing module 604, specifically for training data is isolated annotated sequence and Chinese sequence, to Chinese sequence into
The filtering of row stop words establishes dictionary and according to dictionary index by text sequence coded treatment.
As an optional embodiment of the embodiment of the present invention, model training module 605 utilizes in the following way
Encoder Decoder structure building model is trained pretreated training data, saves optimal index model: model
Training module 605 in encoder encoder, leads to specifically for using structure for the model of Encoder-Decoder structure
The convolutional neural networks abstraction sequence feature that setting convolution nucleus number is respectively 3,4,5 is crossed, passes through a forward-backward recutrnce nerve respectively
Network is serialized, and self attention is added and generates the centre that corresponding attention weight is exported as the end Encoder
State value constitutes decoder by 2 layers of forward-backward recutrnce neural network at the end decoder, respectively by the target sequence of previous moment
In column input decoder, is acted on intermediate state layer generation, generate the target sequence of next time step.
As an optional embodiment of the embodiment of the present invention, prediction module 606 utilizes optimal finger in the following way
Mark model is treated forecast sample and predicted: prediction module 606 is specifically used for using Beamsearch strategy, setting
Beamsearch size value is 3, chooses arrangement maximum probability item as output sequence, obtains the synonymous word order of sample to be predicted
Column.
It should be understood by those skilled in the art that, embodiments herein can provide as method, system or computer program
Product.Therefore, complete hardware embodiment, complete software embodiment or reality combining software and hardware aspects can be used in the application
Apply the form of example.Moreover, it wherein includes the computer of computer usable program code that the application, which can be used in one or more,
The computer program implemented in usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) produces
The form of product.
The application is referring to method, the process of equipment (system) and computer program product according to the embodiment of the present application
Figure and/or block diagram describe.It should be understood that every one stream in flowchart and/or the block diagram can be realized by computer program instructions
The combination of process and/or box in journey and/or box and flowchart and/or the block diagram.It can provide these computer programs
Instruct the processor of general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to produce
A raw machine, so that being generated by the instruction that computer or the processor of other programmable data processing devices execute for real
The device for the function of being specified in present one or more flows of the flowchart and/or one or more blocks of the block diagram.
These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy
Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates,
Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or
The function of being specified in multiple boxes.
These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting
Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or
The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one
The step of function of being specified in a box or multiple boxes.
In a typical configuration, calculating equipment includes one or more processors (CPU), input/output interface, net
Network interface and memory.
Memory may include the non-volatile memory in computer-readable medium, random access memory (RAM) and/
Or the forms such as Nonvolatile memory, such as read-only memory (ROM) or flash memory (flash RAM).Memory is computer-readable Jie
The example of matter.
Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method
Or technology come realize information store.Information can be computer readable instructions, data structure, the module of program or other data.
The example of the storage medium of computer includes, but are not limited to phase change memory (PRAM), static random access memory (SRAM), moves
State random access memory (DRAM), other kinds of random access memory (RAM), read-only memory (ROM), electric erasable
Programmable read only memory (EEPROM), flash memory or other memory techniques, read-only disc read only memory (CD-ROM) (CD-ROM),
Digital versatile disc (DVD) or other optical storage, magnetic cassettes, tape magnetic disk storage or other magnetic storage devices
Or any other non-transmission medium, can be used for storage can be accessed by a computing device information.As defined in this article, it calculates
Machine readable medium does not include temporary computer readable media (transitory media), such as the data-signal and carrier wave of modulation.
The above is only embodiments herein, are not intended to limit this application.To those skilled in the art,
Various changes and changes are possible in this application.It is all within the spirit and principles of the present application made by any modification, equivalent replacement,
Improve etc., it should be included within the scope of the claims of this application.
Claims (10)
1. a kind of enterprise or tissue Chinese entity disambiguation method based on recognition sequence characterized by comprising
It crawls disclosed news data collection and data cleansing is carried out to the news data collection, the data after being cleaned;
The entity word in data after extracting the cleaning, obtains preliminary specification data;Wherein, the entity word include with down toward
It is one of few: company name COM and organization name ORG;
Semantic template rule is set, the preliminary specification data are screened, are obtained to authority data;
Determine the synonymous standard words in authority data and synonymous adverbial word, it is clear described to the synonym in authority data
It is right;
Data mark strategy is set, is labeled to described to authority data, artificial constructed data is added and carries out data enhancing, obtain
To training data;
Using the preliminary specification data pre-training word vector and term vector, by the word vector and the term vector vertical direction
It merges to obtain new vector;
Pre-process the training data;
The pretreated training data is trained using Encoder Decoder structure building model, is saved optimal
Index model;
Forecast sample is treated using the optimal index model to be predicted, using Beamsearch strategy, chooses arrangement probability
Maximal term obtains the synonymous word sequence of the sample to be predicted as output sequence.
2. the method according to claim 1, wherein described crawl disclosed news data collection and to the news
Data set carries out data cleansing, and the data after being cleaned include:
The disclosed country, economy, science and technology news data, removal spcial character and meaningless symbol are crawled, and checks for sky
Value, removes the item data, the data after obtaining the cleaning if having null value;
The entity word in data after the extraction cleaning, obtaining preliminary specification data includes:
The data after the cleaning are handled using Chinese Named Entity Extraction Model trained in advance, extract each sentence
The entity word of middle company name and organization name carries out long and short verse and compiles as training supplement corpus, each sentence number of words
Control is within default number of words;
Setting semantic template rule, screens the preliminary specification data, obtain include: to authority data
By setting "<ORG>+verb+< ORG>/< COM>+verb+< ORG>/< ORG>+verb+< COM>/< COM>+verb+
High-frequency verb dictionary in < COM > " pattern rule and artificial constructed field carries out data screening, obtains described to authority data;
And/or
Described in the determination in authority data synonymous standard words and synonymous adverbial word, it is clear described to synonymous in authority data
Word is to including:
Using the entity word in authority data in a sentence more than number of words as synonymous standard words, determine another in the sentence
One entity word and the synonymous standard words belong to same class, and each word is contained in institute in another described entity word
It states in synonymous standard word, and the number of words of another entity word is greater than 1, then by another described entity word as synonymous pair
Word determines that the synonymous standard words in the sentence and the synonymous adverbial word belong to synonym pair.
3. the method according to claim 1, wherein
The setting data mark strategy, is labeled to described to authority data, adds artificial constructed data and carries out data increasing
By force, obtaining training data includes:
The first character in authority data in synonymous standard words described in each sentence is marked into SEi, it is described synonymous
Other words in standard words are labeled as Ii, and the first character in the synonymous adverbial word is labeled as E2, its in the synonymous adverbial word
His word is labeled as I2, and word of other in sentence not in synonym pair is labeled as O;
The artificial supplementation data for meeting semantic template rule are added in the preliminary specification data, the training data is obtained,
Wherein, the artificial supplementation data are mixed to authority data by random alignment processing with described;
Utilization the preliminary specification data pre-training word vector and the term vector, the word vector is vertical with the term vector
Direction merges to obtain new vector
Using the Word2vec model of Skip structure, training obtains the word vector, is formed after participle is added in the entity word
Dictionary, training obtains the term vector, by the term vector and the word vector carry out vertical direction merge to obtain it is described newly
Vector;And/or
The pretreatment training data includes:
The training data is isolated into annotated sequence and Chinese sequence, stop words filtering is carried out to the Chinese sequence, is established
Dictionary and according to dictionary index by text sequence coded treatment.
4. the method according to claim 1, wherein described construct model using Encoder Decoder structure
The pretreated training data is trained, saving optimal index model includes:
It uses structure for the model of Encoder-Decoder structure, in encoder encoder, passes through setting convolution nucleus number point
Not Wei 3,4,5 convolutional neural networks abstraction sequence feature, serialized by a forward-backward recutrnce neural network, added respectively
Enter self attention and generate the intermediate state value that corresponding attention weight is exported as the end Encoder, at the end decoder
Decoders are constituted by 2 layers of forward-backward recutrnce neural network, respectively by the target sequence inputting decoder of previous moment, with
Intermediate state layer generation effect, generates the target sequence of next time step.
5. according to the method described in claim 4, it is characterized in that, described treat forecast sample using the optimal index model
Carrying out prediction includes:
Using Beamsearch strategy, setting Beamsearch size value is 3, chooses arrangement maximum probability item as output sequence
Column, obtain the synonymous word sequence of the sample to be predicted.
6. a kind of enterprise or tissue Chinese entity disambiguator based on recognition sequence characterized by comprising
Data set constructs module, for crawling disclosed news data collection and carrying out data cleansing to the news data collection, obtains
Data after to cleaning;The entity word in data after extracting the cleaning, obtains preliminary specification data;Wherein, the entity
Word includes at least one of: company name COM and organization name ORG;Be arranged semantic template rule, to the preliminary specification data into
Row screening, obtains to authority data;Determine the synonymous standard words in authority data and synonymous adverbial word, it is clear described wait advise
Synonym pair of the norm in;
Data labeling module is labeled to authority data to described for setting data mark strategy, adds artificial constructed number
According to data enhancing is carried out, training data is obtained;
Vector training module, for utilize the preliminary specification data pre-training word vector and term vector, by the word vector with
The term vector vertical direction merges to obtain new vector;
Preprocessing module, for pre-processing the training data;
Model training module, for constructing model to the pretreated training data using Encoder Decoder structure
It is trained, saves optimal index model;
Prediction module is predicted for treating forecast sample using the optimal index model, tactful using Beamsearch,
Arrangement maximum probability item is chosen as output sequence, obtains the synonymous word sequence of the sample to be predicted.
7. device according to claim 6, which is characterized in that
The data set building module crawls disclosed news data collection in the following way and carries out to the news data collection
Data cleansing, the data after being cleaned:
The data set building module removes special word specifically for crawling the disclosed country, economy, science and technology news data
Symbol and meaningless symbol, and null value is checked for, the item data is removed if having null value, the data after obtaining the cleaning;
The data set building module extracts the entity word in the data after the cleaning in the following way, obtains preliminary specification
Data:
The data set constructs module, is specifically used for using Chinese Named Entity Extraction Model trained in advance to the cleaning
Data afterwards are handled, and are extracted the entity word of company name and organization name in each sentence and are carried out simultaneously as training supplement corpus
Long and short verse is compiled, and each sentence number of words control is within default number of words;
Semantic template rule is arranged in the data set building module in the following way, sieves to the preliminary specification data
Choosing, obtains to authority data:
The data set constructs module, be specifically used for by setting "<ORG>+verb+< ORG>/< COM>+verb+< ORG>/
High-frequency verb dictionary in < ORG >+verb+< COM >/< COM >+verb+< COM > " pattern rule and artificial constructed field, is counted
According to screening, obtain described to authority data;
And/or
The data set building module is determined as follows the synonymous standard words in authority data and synonymous adverbial word,
It is clear described to the synonym pair in authority data:
The data set constructs module, specifically for using the entity word in authority data in a sentence more than number of words as
Synonymous standard words determine in the sentence that another entity word and the synonymous standard words belong to same class, and it is described another
Each word is contained in the synonymous standard word in entity word, and the number of words of another entity word is greater than 1, then
By another described entity word as synonymous adverbial word, determine that the synonymous standard words in the sentence and the synonymous adverbial word belong to together
Adopted word pair.
8. device according to claim 6, which is characterized in that
The data labeling module sets data mark strategy in the following way, is labeled, adds to authority data to described
Add artificial constructed data to carry out data enhancing, obtain training data:
The data labeling module, specifically for by described in authority data in synonymous standard words described in each sentence
First character marks SEi, other words in the synonymous standard words are labeled as Ii, the first character mark in the synonymous adverbial word
For E2, other words in the synonymous adverbial word are labeled as I2, and word of other in sentence not in synonym pair is labeled as O;It will meet
The artificial supplementation data of semantic template rule are added in the preliminary specification data, obtain the training data, wherein the people
Work supplementary data is mixed to authority data by random alignment processing with described;
The vector training module utilizes the preliminary specification data pre-training word vector and term vector in the following way, by institute
It states word vector and the term vector vertical direction merges to obtain new vector:
The vector training module, specifically for the Word2vec model using Skip structure, training obtains the word vector, will
The dictionary formed after participle is added in the entity word, and training obtains the term vector, by the term vector and the word vector into
Row vertical direction merges to obtain the new vector;
And/or
The preprocessing module pre-processes the training data in the following way:
The preprocessing module, specifically for the training data is isolated annotated sequence and Chinese sequence, to the Chinese
Sequence carries out stop words filtering, establishes dictionary and according to dictionary index by text sequence coded treatment.
9. device according to claim 6, which is characterized in that the model training module utilizes in the following way
Encoder Decoder structure building model is trained the pretreated training data, saves optimal index model:
The model training module, specifically for using structure for the model of Encoder-Decoder structure, in encoder
In encoder, the convolutional neural networks abstraction sequence feature for being respectively 3,4,5 by the way that convolution nucleus number is arranged passes through one respectively
Forward-backward recutrnce neural network is serialized, and self attention is added and generates corresponding attention weight as Encoder
The intermediate state value for holding output, when the end decoder constitutes decoder by 2 layers of forward-backward recutrnce neural network, respectively will be previous
In the target sequence inputting decoder at quarter, is acted on intermediate state layer generation, generate the target sequence of next time step
Column.
10. device according to claim 9, which is characterized in that the prediction module utilize in the following way it is described most
Excellent index model is treated forecast sample and is predicted:
The prediction module is specifically used for using Beamsearch strategy, and setting Beamsearch size value is 3, chooses arrangement
Maximum probability item obtains the synonymous word sequence of the sample to be predicted as output sequence.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910297022.1A CN110020438B (en) | 2019-04-15 | 2019-04-15 | Sequence identification based enterprise or organization Chinese name entity disambiguation method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910297022.1A CN110020438B (en) | 2019-04-15 | 2019-04-15 | Sequence identification based enterprise or organization Chinese name entity disambiguation method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110020438A true CN110020438A (en) | 2019-07-16 |
CN110020438B CN110020438B (en) | 2020-12-08 |
Family
ID=67191295
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910297022.1A Active CN110020438B (en) | 2019-04-15 | 2019-04-15 | Sequence identification based enterprise or organization Chinese name entity disambiguation method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110020438B (en) |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110516233A (en) * | 2019-08-06 | 2019-11-29 | 深圳和而泰家居在线网络科技有限公司 | Method, apparatus, terminal device and the storage medium of data processing |
CN111079435A (en) * | 2019-12-09 | 2020-04-28 | 深圳追一科技有限公司 | Named entity disambiguation method, device, equipment and storage medium |
CN111079418A (en) * | 2019-11-06 | 2020-04-28 | 科大讯飞股份有限公司 | Named body recognition method and device, electronic equipment and storage medium |
CN111259087A (en) * | 2020-01-10 | 2020-06-09 | 中国科学院软件研究所 | Computer network protocol entity linking method and system based on domain knowledge base |
CN111339319A (en) * | 2020-03-02 | 2020-06-26 | 北京百度网讯科技有限公司 | Disambiguation method and device for enterprise name, electronic equipment and storage medium |
CN111581335A (en) * | 2020-05-14 | 2020-08-25 | 腾讯科技(深圳)有限公司 | Text representation method and device |
CN111737407A (en) * | 2020-08-25 | 2020-10-02 | 成都数联铭品科技有限公司 | Event unique ID construction method based on event disambiguation |
CN111814479A (en) * | 2020-07-09 | 2020-10-23 | 上海明略人工智能(集团)有限公司 | Enterprise short form generation and model training method and device |
CN112017643A (en) * | 2020-08-24 | 2020-12-01 | 广州市百果园信息技术有限公司 | Speech recognition model training method, speech recognition method and related device |
CN112069826A (en) * | 2020-07-15 | 2020-12-11 | 浙江工业大学 | Vertical domain entity disambiguation method fusing topic model and convolutional neural network |
WO2021139257A1 (en) * | 2020-06-24 | 2021-07-15 | 平安科技(深圳)有限公司 | Method and apparatus for selecting annotated data, and computer device and storage medium |
CN113326380A (en) * | 2021-08-03 | 2021-08-31 | 国能大渡河大数据服务有限公司 | Equipment measurement data processing method, system and terminal based on deep neural network |
CN113609825A (en) * | 2021-10-11 | 2021-11-05 | 北京百炼智能科技有限公司 | Intelligent customer attribute tag identification method and device |
CN113761942A (en) * | 2021-09-14 | 2021-12-07 | 合众新能源汽车有限公司 | Semantic analysis method and device based on deep learning model and storage medium |
US20220083733A1 (en) * | 2019-12-05 | 2022-03-17 | Boe Technology Group Co., Ltd. | Synonym mining method, application method of synonym dictionary, medical synonym mining method, application method of medical synonym dictionary, synonym mining device and storage medium |
CN114398492A (en) * | 2021-12-24 | 2022-04-26 | 森纵艾数(北京)科技有限公司 | Knowledge graph construction method, terminal and medium in digital field |
US20220171924A1 (en) * | 2019-03-25 | 2022-06-02 | Nippon Telegraph And Telephone Corporation | Index value giving apparatus, index value giving method and program |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2008106473A1 (en) * | 2007-02-26 | 2008-09-04 | Microsoft Corporation | Automatic disambiguation based on a reference resource |
US8856119B2 (en) * | 2009-02-27 | 2014-10-07 | International Business Machines Corporation | Holistic disambiguation for entity name spotting |
CN104111973A (en) * | 2014-06-17 | 2014-10-22 | 中国科学院计算技术研究所 | Scholar name duplication disambiguation method and system |
CN104268200A (en) * | 2013-09-22 | 2015-01-07 | 中科嘉速(北京)并行软件有限公司 | Unsupervised named entity semantic disambiguation method based on deep learning |
CN106407180A (en) * | 2016-08-30 | 2017-02-15 | 北京奇艺世纪科技有限公司 | Entity disambiguation method and apparatus |
CN107391613A (en) * | 2017-07-04 | 2017-11-24 | 北京航空航天大学 | A kind of automatic disambiguation method of more documents of industry security theme and device |
CN107784125A (en) * | 2017-11-24 | 2018-03-09 | 中国银行股份有限公司 | A kind of entity relation extraction method and device |
CN108572960A (en) * | 2017-03-08 | 2018-09-25 | 富士通株式会社 | Place name disappears qi method and place name disappears qi device |
-
2019
- 2019-04-15 CN CN201910297022.1A patent/CN110020438B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2008106473A1 (en) * | 2007-02-26 | 2008-09-04 | Microsoft Corporation | Automatic disambiguation based on a reference resource |
US8856119B2 (en) * | 2009-02-27 | 2014-10-07 | International Business Machines Corporation | Holistic disambiguation for entity name spotting |
CN104268200A (en) * | 2013-09-22 | 2015-01-07 | 中科嘉速(北京)并行软件有限公司 | Unsupervised named entity semantic disambiguation method based on deep learning |
CN104111973A (en) * | 2014-06-17 | 2014-10-22 | 中国科学院计算技术研究所 | Scholar name duplication disambiguation method and system |
CN106407180A (en) * | 2016-08-30 | 2017-02-15 | 北京奇艺世纪科技有限公司 | Entity disambiguation method and apparatus |
CN108572960A (en) * | 2017-03-08 | 2018-09-25 | 富士通株式会社 | Place name disappears qi method and place name disappears qi device |
CN107391613A (en) * | 2017-07-04 | 2017-11-24 | 北京航空航天大学 | A kind of automatic disambiguation method of more documents of industry security theme and device |
CN107784125A (en) * | 2017-11-24 | 2018-03-09 | 中国银行股份有限公司 | A kind of entity relation extraction method and device |
Non-Patent Citations (2)
Title |
---|
ANGELL.GARRIDO ET AL.: "NEREA:Named Entity Recognition and Disambiguation Exploiting Local Document Repositories", 《2016 IEEE 28TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE》 * |
葛斌 等: "基于模板的无导词义消歧方法", 《计算机工程与科学》 * |
Cited By (29)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20220171924A1 (en) * | 2019-03-25 | 2022-06-02 | Nippon Telegraph And Telephone Corporation | Index value giving apparatus, index value giving method and program |
US11960836B2 (en) * | 2019-03-25 | 2024-04-16 | Nippon Telegraph And Telephone Corporation | Index value giving apparatus, index value giving method and program |
CN110516233B (en) * | 2019-08-06 | 2023-08-01 | 深圳数联天下智能科技有限公司 | Data processing method, device, terminal equipment and storage medium |
CN110516233A (en) * | 2019-08-06 | 2019-11-29 | 深圳和而泰家居在线网络科技有限公司 | Method, apparatus, terminal device and the storage medium of data processing |
CN111079418A (en) * | 2019-11-06 | 2020-04-28 | 科大讯飞股份有限公司 | Named body recognition method and device, electronic equipment and storage medium |
CN111079418B (en) * | 2019-11-06 | 2023-12-05 | 科大讯飞股份有限公司 | Named entity recognition method, device, electronic equipment and storage medium |
US20220083733A1 (en) * | 2019-12-05 | 2022-03-17 | Boe Technology Group Co., Ltd. | Synonym mining method, application method of synonym dictionary, medical synonym mining method, application method of medical synonym dictionary, synonym mining device and storage medium |
US11977838B2 (en) * | 2019-12-05 | 2024-05-07 | Boe Technology Group Co., Ltd. | Synonym mining method, application method of synonym dictionary, medical synonym mining method, application method of medical synonym dictionary, synonym mining device and storage medium |
CN111079435A (en) * | 2019-12-09 | 2020-04-28 | 深圳追一科技有限公司 | Named entity disambiguation method, device, equipment and storage medium |
CN111259087A (en) * | 2020-01-10 | 2020-06-09 | 中国科学院软件研究所 | Computer network protocol entity linking method and system based on domain knowledge base |
CN111259087B (en) * | 2020-01-10 | 2022-10-14 | 中国科学院软件研究所 | Computer network protocol entity linking method and system based on domain knowledge base |
CN111339319A (en) * | 2020-03-02 | 2020-06-26 | 北京百度网讯科技有限公司 | Disambiguation method and device for enterprise name, electronic equipment and storage medium |
CN111339319B (en) * | 2020-03-02 | 2023-08-04 | 北京百度网讯科技有限公司 | Enterprise name disambiguation method and device, electronic equipment and storage medium |
CN111581335B (en) * | 2020-05-14 | 2023-11-24 | 腾讯科技(深圳)有限公司 | Text representation method and device |
CN111581335A (en) * | 2020-05-14 | 2020-08-25 | 腾讯科技(深圳)有限公司 | Text representation method and device |
WO2021139257A1 (en) * | 2020-06-24 | 2021-07-15 | 平安科技(深圳)有限公司 | Method and apparatus for selecting annotated data, and computer device and storage medium |
CN111814479A (en) * | 2020-07-09 | 2020-10-23 | 上海明略人工智能(集团)有限公司 | Enterprise short form generation and model training method and device |
CN111814479B (en) * | 2020-07-09 | 2023-08-25 | 上海明略人工智能(集团)有限公司 | Method and device for generating enterprise abbreviations and training model thereof |
CN112069826B (en) * | 2020-07-15 | 2021-12-07 | 浙江工业大学 | Vertical domain entity disambiguation method fusing topic model and convolutional neural network |
CN112069826A (en) * | 2020-07-15 | 2020-12-11 | 浙江工业大学 | Vertical domain entity disambiguation method fusing topic model and convolutional neural network |
CN112017643B (en) * | 2020-08-24 | 2023-10-31 | 广州市百果园信息技术有限公司 | Speech recognition model training method, speech recognition method and related device |
CN112017643A (en) * | 2020-08-24 | 2020-12-01 | 广州市百果园信息技术有限公司 | Speech recognition model training method, speech recognition method and related device |
CN111737407A (en) * | 2020-08-25 | 2020-10-02 | 成都数联铭品科技有限公司 | Event unique ID construction method based on event disambiguation |
CN113326380A (en) * | 2021-08-03 | 2021-08-31 | 国能大渡河大数据服务有限公司 | Equipment measurement data processing method, system and terminal based on deep neural network |
CN113761942A (en) * | 2021-09-14 | 2021-12-07 | 合众新能源汽车有限公司 | Semantic analysis method and device based on deep learning model and storage medium |
CN113761942B (en) * | 2021-09-14 | 2023-12-05 | 合众新能源汽车股份有限公司 | Semantic analysis method, device and storage medium based on deep learning model |
CN113609825A (en) * | 2021-10-11 | 2021-11-05 | 北京百炼智能科技有限公司 | Intelligent customer attribute tag identification method and device |
CN114398492A (en) * | 2021-12-24 | 2022-04-26 | 森纵艾数(北京)科技有限公司 | Knowledge graph construction method, terminal and medium in digital field |
CN114398492B (en) * | 2021-12-24 | 2022-08-30 | 森纵艾数(北京)科技有限公司 | Knowledge graph construction method, terminal and medium in digital field |
Also Published As
Publication number | Publication date |
---|---|
CN110020438B (en) | 2020-12-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110020438A (en) | Enterprise or tissue Chinese entity disambiguation method and device based on recognition sequence | |
WO2022141878A1 (en) | End-to-end language model pretraining method and system, and device and storage medium | |
CN110209822A (en) | Sphere of learning data dependence prediction technique based on deep learning, computer | |
CN117236338B (en) | Named entity recognition model of dense entity text and training method thereof | |
CN114818717A (en) | Chinese named entity recognition method and system fusing vocabulary and syntax information | |
CN112818698A (en) | Fine-grained user comment sentiment analysis method based on dual-channel model | |
CN114282592A (en) | Deep learning-based industry text matching model method and device | |
CN114398900A (en) | Long text semantic similarity calculation method based on RoBERTA model | |
CN113869054A (en) | Deep learning-based electric power field project feature identification method | |
Xue et al. | A method of chinese tourism named entity recognition based on bblc model | |
Xu et al. | Short text classification of chinese with label information assisting | |
Zhao et al. | Chinese named entity recognition in power domain based on Bi-LSTM-CRF | |
CN109117471A (en) | A kind of calculation method and terminal of the word degree of correlation | |
CN116680407A (en) | Knowledge graph construction method and device | |
CN114548090B (en) | Fast relation extraction method based on convolutional neural network and improved cascade labeling | |
Wang et al. | Predicting the Chinese poetry prosodic based on a developed BERT model | |
CN112613316B (en) | Method and system for generating ancient Chinese labeling model | |
CN114238649A (en) | Common sense concept enhanced language model pre-training method | |
Xu et al. | Causal event extraction using causal event element-oriented neural network | |
Wu | A computational neural network model for college English grammar correction | |
Yao et al. | Heterogeneous Graph Neural Network for Chinese Financial Event Extraction | |
Ding et al. | Graph structure-aware bi-directional graph convolution model for semantic role labeling | |
Chen | Automatic Assessment Method of Oral English Based on Multimodality | |
Liu et al. | Text Analysis of Community Governance Case based on Entity and Relation Extraction | |
CN115114915B (en) | Phrase identification method, device, equipment and medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |