CN110020438A - Enterprise or tissue Chinese entity disambiguation method and device based on recognition sequence - Google Patents

Enterprise or tissue Chinese entity disambiguation method and device based on recognition sequence Download PDF

Info

Publication number
CN110020438A
CN110020438A CN201910297022.1A CN201910297022A CN110020438A CN 110020438 A CN110020438 A CN 110020438A CN 201910297022 A CN201910297022 A CN 201910297022A CN 110020438 A CN110020438 A CN 110020438A
Authority
CN
China
Prior art keywords
data
word
synonymous
training
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910297022.1A
Other languages
Chinese (zh)
Other versions
CN110020438B (en
Inventor
顾凌云
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Ice Stephen Mdt Infotech Ltd
Shanghai IceKredit Inc
Original Assignee
Shanghai Ice Stephen Mdt Infotech Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Ice Stephen Mdt Infotech Ltd filed Critical Shanghai Ice Stephen Mdt Infotech Ltd
Priority to CN201910297022.1A priority Critical patent/CN110020438B/en
Publication of CN110020438A publication Critical patent/CN110020438A/en
Application granted granted Critical
Publication of CN110020438B publication Critical patent/CN110020438B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The present invention provides a kind of enterprise based on recognition sequence or tissue Chinese entity disambiguation methods and device, and wherein method includes: to crawl disclosed news data collection and carry out data cleansing, the data after being cleaned;The entity word in data after extracting cleaning, obtains preliminary specification data;Semantic template rule is set, preliminary specification data are screened, are obtained to authority data;The determining synonymous standard words in authority data and synonymous adverbial word, clearly to the synonym pair in authority data;Data mark strategy is set, authority data is treated and is labeled, add artificial constructed data, obtain training data;Pre-training word vector and term vector merge word vector and term vector vertical direction to obtain new vector;Pretreated training data is trained using Encoder Decoder structure building model, saves optimal index model;Forecast sample is treated using optimal index model to be predicted.

Description

Enterprise or tissue Chinese entity disambiguation method and device based on recognition sequence
Technical field
The present invention relates to entities to disambiguate technical field more particularly to a kind of enterprise or tissue Chinese name based on recognition sequence Claim entity disambiguation method and device.
Background technique
Entity disambiguates, and concept is to pass through some way to avoid same noun but different meaning bring semantic understanding entanglements Problem occurs.In recent years, along with the development of artificial intelligence technology, market is for from one section of long text, by Chinese synonym The demand for accurately identifying out is more and more obvious, and for law, financial industry, this demand is more urgent.And with The development of natural language processing technique, the entity disambiguation method in Chinese field is also more and more, have currently on the market based on text The entity disambiguation method of this classification and the entity disambiguation method in knowledge based library and deep learning integrated use.But these technologies are all It has one disadvantage in that, that is, entity disambiguation problem is converted into text classification problem, there are the following problems: 1. machine learning behind The model in field can not good extraction text context feature.2. the mode for being converted to text classification is handled, need to judge Each entity word ambiguity situation, behind need the knowledge base of large amount of complex as support.Such situation will lead to item In technology building process needed for mesh, there is a situation where it is complex, from cost control with all lacked in performance it is good Application.
It and is in recent years the series model of structure for text language model treatment side along with using Encoder-Decoder The rise of formula, this has also given Chinese domain entities disambiguation method to bring new thinking.This model structure, which works as text, contributes a foreword Column are handled, and input is to build the short text comprising multiple entity words of the same name in advance, and output is each in short text The corresponding alphanumeric tag of word.This mode can be by character features in extracting, thus good by text context packet Type containing progressive die is trained.The positionembedding based on text participle proposed simultaneously with Google is as phrase position The information characteristics set, by this be embedded in neural metwork training processing mode and multilayer attention processing in a manner of, as Transformer model structure, also for based on the entity of recognition sequence thinking disambiguate technology bring obtain more preferable result can Energy.
Summary of the invention
The present invention is intended to provide a kind of overcome the problems, such as one of above problem or at least be partially solved any of the above-described base In enterprise's the analysis of public opinion method and apparatus from attention.
In order to achieve the above objectives, technical solution of the present invention is specifically achieved in that
One aspect of the present invention provides a kind of enterprise based on recognition sequence or tissue Chinese entity disambiguation side Method, comprising: it crawls disclosed news data collection and data cleansing is carried out to news data collection, the data after being cleaned;It extracts The entity word in data after cleaning, obtains preliminary specification data;Wherein, entity word includes at least one of: company name COM With organization name ORG;Semantic template rule is set, preliminary specification data are screened, are obtained to authority data;It determines wait standardize Synonymous standard words and synonymous adverbial word in data, clearly to the synonym pair in authority data;Data mark strategy is set, is treated Authority data is labeled, and is added artificial constructed data and is carried out data enhancing, obtains training data;It is pre- using preliminary specification data Training word vector and term vector merge word vector and term vector vertical direction to obtain new vector;Pre-process training data; Pretreated training data is trained using Encoder Decoder structure building model, saves optimal index model; Forecast sample is treated using optimal index model to be predicted, using Beamsearch strategy, is chosen arrangement maximum probability item and is made For output sequence, the synonymous word sequence of sample to be predicted is obtained.
Wherein, it crawls disclosed news data collection and data cleansing is carried out to news data collection, the data after being cleaned It include: to crawl the disclosed country, economy, science and technology news data, removal spcial character and meaningless symbol, and check for sky Value, removes the item data, the data after being cleaned if having null value;The entity word in data after extracting cleaning, obtains just Step authority data includes: to be handled using Chinese Named Entity Extraction Model trained in advance the data after cleaning, is mentioned It takes the entity word of company name and organization name in each sentence to be used as training supplement corpus, while carrying out long and short verse and compiling, each The control of sentence number of words is within default number of words;Semantic template rule is set, preliminary specification data are screened, are obtained wait standardize Data include: by setting "<ORG>+verb+< ORG>/< COM>+verb+< ORG>/< ORG>+verb+< COM>/< COM>+it is dynamic High-frequency verb dictionary in word+< COM > " pattern rule and artificial constructed field carries out data screening, obtains to authority data;With/ Or determine in authority data synonymous standard words and synonymous adverbial word, clearly to the synonym in authority data to include: by Entity word in authority data in a sentence more than number of words as synonymous standard words, determine in sentence another entity word with it is synonymous Standard words belong to same class, and each word is contained in synonymous standard word in another entity word, and another reality The number of words of pronouns, general term for nouns, numerals and measure words is greater than 1, then using another entity word as synonymous adverbial word, determines the synonymous standard words in sentence and synonymous adverbial word Belong to synonym pair.
Wherein, setting data mark strategy, treats authority data and is labeled, and adds artificial constructed data and carries out data increasing By force, obtaining training data includes: that will mark SEi to the first character in authority data in each sentence in synonymous standard words, Other words in synonymous standard words are labeled as Ii, and the first character in synonymous adverbial word is labeled as E2, other words in synonymous adverbial word It is labeled as I2, word of other in sentence not in synonym pair is labeled as O;The artificial supplementation data for meeting semantic template rule are added Enter in preliminary specification data, obtain training data, wherein artificial supplementation data are mixed with to authority data by random alignment processing It is combined;Using preliminary specification data pre-training word vector and term vector, word vector and term vector vertical direction are closed And obtaining new vector includes: the Word2vec model using Skip structure, training obtains word vector, after participle is added in entity word The dictionary of formation, training obtain term vector, and term vector is carried out vertical direction with word vector and merges to obtain new vector;And/or it is pre- Processing training data includes: that training data is isolated annotated sequence and Chinese sequence, carries out stop words filtering to Chinese sequence, Establish dictionary and according to dictionary index by text sequence coded treatment.
Wherein, pretreated training data is trained using Encoder Decoder structure building model, is saved Optimal index model includes: to use structure for the model of Encoder-Decoder structure, in encoder encoder, by setting The convolutional neural networks abstraction sequence feature that convolution nucleus number is respectively 3,4,5 is set, passes through a forward-backward recutrnce neural network respectively It is serialized, self attention is added and generates the intermediate state that corresponding attention weight is exported as the end Encoder Value constitutes decoder by 2 layers of forward-backward recutrnce neural network at the end decoder, respectively that the target sequence of previous moment is defeated Enter in decoder, is acted on intermediate state layer generation, generate the target sequence of next time step.
Wherein, treating forecast sample to carry out prediction using optimal index model includes: using Beamsearch strategy, setting Beamsearch size value is 3, chooses arrangement maximum probability item as output sequence, obtains the synonymous word order of sample to be predicted Column.
Another aspect of the present invention provides a kind of enterprise based on recognition sequence or tissue Chinese entity disambiguator, It include: that data set building module obtains clear for crawling disclosed news data collection and carrying out data cleansing to news data collection Data after washing;The entity word in data after extracting cleaning, obtains preliminary specification data;Wherein, entity word include with down toward It is one of few: company name COM and organization name ORG;Semantic template rule is set, preliminary specification data are screened, are obtained wait advise Norm evidence;The determining synonymous standard words in authority data and synonymous adverbial word, clearly to the synonym pair in authority data;Data Labeling module is treated authority data and is labeled, add artificial constructed data and carry out data increasing for setting data mark strategy By force, training data is obtained;Vector training module, for utilize preliminary specification data pre-training word vector and term vector, by word to Amount merges to obtain new vector with term vector vertical direction;Preprocessing module, for pre-processing training data;Model training mould Block saves optimal finger for being trained using Encoder Decoder structure building model to pretreated training data Mark model;Prediction module is predicted for treating forecast sample using optimal index model, tactful using Beamsearch, Arrangement maximum probability item is chosen as output sequence, obtains the synonymous word sequence of sample to be predicted.
Wherein, data set building module crawls disclosed news data collection in the following way and carries out to news data collection Data cleansing, the data after being cleaned: data set constructs module, new specifically for crawling the disclosed country, economy, science and technology Data, removal spcial character and meaningless symbol are heard, and checks for null value, the item data is removed if having null value, is obtained clear Data after washing;Data set building module extracts the entity word in the data after cleaning in the following way, obtains preliminary specification Data: data set constructs module, specifically for using Chinese Named Entity Extraction Model trained in advance to the number after cleaning According to being handled, extracts the entity word of company name and organization name in each sentence and be used as training supplement corpus, while carrying out long and short verse It compiles, each sentence number of words control is within default number of words;Semantic mould is arranged in data set building module in the following way Plate gauge then, screens preliminary specification data, obtains to authority data: data set constructs module, specifically for passing through setting "<ORG>+verb+< ORG>/< COM>+verb+< ORG>/< ORG>+verb+< COM>/< COM>+verb+< COM>" pattern rule With high-frequency verb dictionary in artificial constructed field, data screening is carried out, is obtained to authority data;And/or data set constructs module Be determined as follows in authority data synonymous standard words and synonymous adverbial word, clearly to the synonym in authority data Right: data set constructs module, specifically for will be to the entity word in authority data in a sentence more than number of words as synonymous standard Word determines that another entity word and synonymous standard words belong to same class, and each word in another entity word in sentence It is contained in synonymous standard word, and the number of words of another entity word is greater than 1, then using another entity word as synonymous adverbial word, Determine that synonymous standard words and synonymous adverbial word in sentence belong to synonym pair.
Wherein, data labeling module sets data mark strategy in the following way, treats authority data and is labeled, adds Add artificial constructed data to carry out data enhancing, obtain training data: data labeling module, being specifically used for will be to every in authority data First character in one sentence in synonymous standard words marks SEi, other words in synonymous standard words are labeled as Ii, synonymous adverbial word In first character be labeled as E2, other words in synonymous adverbial word are labeled as I2, word mark of other in sentence not in synonym pair Note is O;The artificial supplementation data for meeting semantic template rule are added in preliminary specification data, obtain training data, wherein people Work supplementary data is mixed with to authority data by random alignment processing;Vector training module utilizes in the following way Preliminary specification data pre-training word vector and term vector merge word vector and term vector vertical direction to obtain new vector: Vector training module, specifically for the Word2vec model using Skip structure, training obtains word vector, and entity word is added and is divided The dictionary formed after word, training obtain term vector, and term vector is carried out vertical direction with word vector and merges to obtain new vector;With/ Or preprocessing module pre-processes training data in the following way: preprocessing module, specifically for training data is separated bid Sequence and Chinese sequence are infused, stop words filtering is carried out to Chinese sequence, dictionary is established and compiles text sequence according to dictionary index Code processing.
Wherein, model training module utilizes Encoder Decoder structure building model to pretreatment in the following way Training data afterwards is trained, and saves optimal index model: model training module, is specifically used for using structure for Encoder- The model of Decoder structure, in encoder encoder, the convolutional neural networks for being respectively 3,4,5 by the way that convolution nucleus number is arranged Abstraction sequence feature is serialized by a forward-backward recutrnce neural network respectively, and self attention is added and generates phase The intermediate state value that the attention weight answered is exported as the end Encoder passes through 2 layers of forward-backward recutrnce nerve net at the end decoder Network constitutes decoder, respectively by the target sequence inputting decoder of previous moment, acts on intermediate state layer generation, raw At the target sequence of next time step.
Wherein, prediction module is treated forecast sample using optimal index model in the following way and is predicted: prediction mould Block is specifically used for using Beamsearch strategy, and setting Beamsearch size value is 3, chooses arrangement maximum probability item conduct Output sequence obtains the synonymous word sequence of sample to be predicted.
It can be seen that the enterprise or tissue Chinese entity disambiguation side provided in an embodiment of the present invention based on recognition sequence Method and device crawl news corpus, thus the effective number of building by constructing simple semantic template and artificial data labeling form According to collection, in this, as model training basis;500w news corpus building pre-training word word vector, word vector are crawled to obtain using itself Dimension is controlled in 500 dimensions;Carry out model construction, propose it is a kind of can be from language model, above and below good extraction text sequence The model structure of literary information.
Detailed description of the invention
In order to illustrate the technical solution of the embodiments of the present invention more clearly, required use in being described below to embodiment Attached drawing be briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for this For the those of ordinary skill in field, without creative efforts, it can also be obtained according to these attached drawings other Attached drawing.
Fig. 1 is the flow chart provided in an embodiment of the present invention based on enterprise's the analysis of public opinion method from attention;
Fig. 2 is dataset construction process schematic provided in an embodiment of the present invention;
Fig. 3 is that data provided in an embodiment of the present invention mark tactful schematic diagram;
Fig. 4 is model E ncoder end structure schematic diagram provided in an embodiment of the present invention;
Fig. 5 is model Decoder end structure schematic diagram provided in an embodiment of the present invention;
Fig. 6 is the structural schematic diagram provided in an embodiment of the present invention based on enterprise's the analysis of public opinion device from attention.
Specific embodiment
Exemplary embodiments of the present disclosure are described in more detail below with reference to accompanying drawings.Although showing the disclosure in attached drawing Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure without should be by embodiments set forth here It is limited.On the contrary, these embodiments are provided to facilitate a more thoroughly understanding of the present invention, and can be by the scope of the present disclosure It is fully disclosed to those skilled in the art.
Enterprise or tissue Chinese entity disambiguation method and device provided in an embodiment of the present invention based on recognition sequence Purpose be:
Firstly, solve in little Wei enterprise reference public sentiment system, when user inputs Sentence Search enterprise or organization name, Whether discrimination there is the case where synonymous entity word, if this all class name will be regarded as same target by the telephone system occurred, from And avoid recall precision caused by entity ambiguity slowly and the generation of the situation of user experience bad luck.
Secondly, the present invention knows according to the thinking for converting entity disambiguation problem to sequence labelling problem getting rid of complexity In the case where knowing library, using seq2seq model structure, a kind of new Chinese entity word disambiguation method is proposed.
Fig. 1 shows the enterprise provided in an embodiment of the present invention based on recognition sequence or tissue Chinese entity disambiguation side The flow chart of method, referring to Fig. 1, the enterprise or tissue Chinese entity provided in an embodiment of the present invention based on recognition sequence is disambiguated Method, comprising:
S1 crawls disclosed news data collection and carries out data cleansing to news data collection, the data after being cleaned.
Specifically, disclosed news data is crawled first, and data cleansing is carried out to the disclosed news data collection crawled (detailed process may refer to Fig. 2) crawls disclosed news data collection as an optional embodiment of the embodiment of the present invention And data cleansing carried out to news data collection, the data after being cleaned include: that crawl the disclosed country, economy, science and technology new Data, removal spcial character and meaningless symbol are heard, and checks for null value, the item data is removed if having null value, is obtained clear Data after washing.
S2, the entity word in data after extracting cleaning, obtains preliminary specification data;Wherein, entity word include with down toward It is one of few: company name COM and organization name ORG;
As an optional embodiment of the embodiment of the present invention, the entity word in the data after cleaning is extracted, is obtained just Step authority data includes: to be handled using Chinese Named Entity Extraction Model trained in advance the data after cleaning, is mentioned It takes the entity word of company name and organization name in each sentence to be used as training supplement corpus, while carrying out long and short verse and compiling, each The control of sentence number of words is within default number of words.
When it is implemented, can be real by the Chinese name built in advance in advance by the disclosed news data collection crawled Body Model (NER system) is handled, the entity with company name (COM), organization name (ORG) in the text that news data is concentrated Word extracts the supplement corpus used as term vector training, and carries out long and short verse and compile, each sentence length control Within 500 words.
S3, setting semantic template rule, screens preliminary specification data, obtains to authority data.
Specifically, as an optional embodiment of the embodiment of the present invention, semantic template rule is set, to preliminary specification Data are screened, obtain include: to authority data by setting "<ORG>+verb+< ORG>/< COM>+verb+< ORG>/ High-frequency verb dictionary in < ORG >+verb+< COM >/< COM >+verb+< COM > " pattern rule and artificial constructed field, is counted According to screening, obtain to authority data.
It is extracted when it is implemented, each newsletter archive is inputted the sentence with these two types of entity words in example, it Pass through this kind of simple semantic mould of setting "<ORG>+verb+<ORG>/<COM>+verb+<ORG>/<ORG>+verb+<COM>" afterwards Plate gauge is then screened, will such as " China Construction Bank is referred to as Construction Bank, is seated Shanghai " it is this kind of meet it is above-mentioned rule Simple sentence extract as data set to be standardized.Wherein, it is moved among company name and organization name field inner template rule Word will be quoted by the way of manually choosing high-frequency verb dictionary.After handling above-mentioned data well, obtained data set Feature is the entity word that each sentence contains only ORG or COM one kind, and each sentence only has 2~4 entities Word.
S4, the determining synonymous standard words in authority data and synonymous adverbial word, clearly to the synonym pair in authority data.
As an optional embodiment of the embodiment of the present invention, determine in authority data synonymous standard words with it is synonymous Adverbial word, clearly to the synonym in authority data to include: will in authority data in a sentence more than number of words entity word make For synonymous standard words, determine that another entity word and synonymous standard words belong to same class in sentence, and in another entity word Each word is contained in synonymous standard word, and the number of words of another entity word be greater than 1, then using another entity word as Synonymous adverbial word determines that the synonymous standard words and synonymous adverbial word in sentence belong to synonym pair.
When it is implemented, the entity word in each sentence is extracted, and chooses sentence after previous action The more word of middle entity word number of words as the synonymous standard words of this, if another entity word belong to the standard word it is same Class, and each word is contained in the standard word in another entity word, and its number of words is greater than 1, then and the word is claimed For synonymous adverbial word, while two entity words just belong to synonym pair in the sentence.For example " China Construction Bank is referred to as in short sentence Construction Bank, be one of biggish nationalized bank, China " in, entity word " China Construction Bank " and " Construction Bank " belong to company name (ORG) this kind of, while each of entity word " Construction Bank " word is all contained in entity word " China Construction Bank ", therefore The two words belong to synonym pair, and the longest word of number of words is then synonymous standard words.In a sentence, if there is synonymous mark Quasi- word and synonymous adverbial word both then show semantically to there is identical meaning, can table name adverbial word semantic information, it can be eliminated Ambiguity property itself.
S5, setting data mark strategy, treats authority data and is labeled, add artificial constructed data and carry out data increasing By force, training data is obtained.
As an optional embodiment of the embodiment of the present invention, data mark strategy is set, authority data progress is treated Mark adds artificial constructed data and carries out data enhancing, obtain training data include: will be in authority data in each sentence First character in synonymous standard words marks SEi, other words in synonymous standard words are labeled as Ii, and first in synonymous adverbial word Word is labeled as E2, other words in synonymous adverbial word are labeled as I2, and word of other in sentence not in synonym pair is labeled as O;It will symbol The artificial supplementation data for closing semantic template rule are added in preliminary specification data, obtain training data, wherein artificial supplementation data It is mixed with to authority data by random alignment processing.
When it is implemented, it is to belong in a sentence that the mark strategy of data used in the embodiment of the present invention is (shown in Figure 3) In two words of synonym pair, the longest word of number of words is determined as synonymous standard words, while its first character is labeled as letter E1, other words of the word are designated as I1;The less synonym of another number of words then therewith similarly, belongs to adverbial word, first character is labeled as E2, other words of the word are labeled as I2.Word unrelated with entity word in this is all labeled as alphabetical O later, and other numbers of words are few In the similar entity word of synonym standard, first character is labeled as E2, other words of the word are labeled as I2.For example, in sentence " China Construction Bank is referred to as Construction Bank or Construction Bank, is one of Chinese large bank " in, " China Construction Bank " is the same of this Adopted standard words are then labeled as " E1I1I1I1I1I1 ", and " Construction Bank " is then labeled as " E2I2 " as corresponding adverbial word, " Construction Bank " Notation methods and " Construction Bank " are similarly.Finally, other words are then labeled as " O " in the sentence.Above-mentioned semantic template rule will met Artificial supplementation data are added in training set, to play data humidification.Wherein, this kind of artificial supplementation data have reached 20,000 Text, and crawl the text handled well with before and upset and mix, it have passed through random alignment processing.
S6 is carried out word vector and term vector vertical direction using preliminary specification data pre-training word vector and term vector Merging obtains new vector.
As an optional embodiment of the embodiment of the present invention, using preliminary specification data pre-training word vector and word to Word vector and term vector vertical direction are merged to obtain new vector to include: the Word2vec model using Skip structure by amount, Training obtains word vector, and the dictionary that is formed after participle is added in entity word, and training obtains term vector, by term vector and word vector into Row vertical direction merges to obtain new vector.
When it is implemented, it is 5 that pre-training word vector used in the embodiment of the present invention, which is in setting parameter window, by crawling 5,000,000 sentences in, obtained from the sentence being consistent with the semantic template rule of building is trained.The embodiment of the present invention uses The Word2vec model of Skip structure, the word vector dimension finally obtained are 500 dimensions.Meanwhile the embodiment of the present invention is also by text It is segmented, by what is formed after the company name (COM) obtained in above-mentioned steps S1 or organization name (ORG) entity word addition participle Dictionary, to carry out term vector training, gained term vector dimension is also 500 dimensions.In order to preferably extract the background that text contains Obtained term vector is carried out vertical direction with word vector and merges to obtain new vector by information, the embodiment of the present invention.Then by institute Obtained new vector carries out insertion training pattern as pre-training vector.
S7 pre-processes training data.
As an optional embodiment of the embodiment of the present invention, pre-processing training data includes: to separate training data Annotated sequence and Chinese sequence out carry out stop words filtering to Chinese sequence, establish dictionary and according to dictionary index by text sequence Column coded treatment.
When it is implemented, carrying out the operation of text data cleaning and data prediction to training data.The text that will be obtained Notebook data is concentrated, and removes additional character with regular expression, introduces the deactivated vocabulary built in advance later, is removed in text It is meaningless for training such as " " " obtaining " etc auxiliary words of mood.It, will later by the sentence marked by Chinese word segmentation It, which is processed into needed for training, obtains training data text input sequence, that is, characterizes are as follows: X1, X2 ..., Xn.It simultaneously will corresponding sentence The mark field of subsequence is processed into the target data text output sequence at T0 moment, that is, characterizes are as follows: Y1, Y2 ..., Yn.It The sequence sentence end target data is spliced afterwards "<EOS>" identifier, it indicates sequence prediction end position, at this time obtains New target sequence be known as T1 moment text output sequence, characterize are as follows: Y1, Y2 ..., Yn,<EOS>.
S8 is trained pretreated training data using Encoder Decoder structure building model, saves most Excellent index model.
As an optional embodiment of the embodiment of the present invention, model pair is constructed using Encoder Decoder structure Pretreated training data is trained, and saving optimal index model includes: to use structure for Encoder-Decoder structure Model, in encoder encoder, by be arranged convolution nucleus number be respectively 3,4,5 convolutional neural networks abstraction sequences it is special Sign is serialized by a forward-backward recutrnce neural network respectively, and self attention is added and generates corresponding attention The intermediate state value that weight is exported as the end Encoder constitutes decoding by 2 layers of forward-backward recutrnce neural network at the end decoder Device acts on, when generating next respectively by the target sequence inputting decoder of previous moment with intermediate state layer generation Between step-length target sequence.
When it is implemented, model structure used in the embodiment of the present invention is Encoder-Decoder structure, which is By inputting one section of Training data text sequence, the sequence is handled by encoder (Encoder), in generation Between hidden layer C.It is generated again by the Target data text sequence at input T0 moment by embedding layers of processing later Matrix (Target | t=t0), which is added, obtains matrix [C, Target | T=T0].And decoder (Decoder) end is then by this matrix As input, in order to predict Target Data sequence that next period T1 should be exported.Wherein convolutional neural networks (CNN) can good extraction text local message, but due to the influence of local receptor field, this network is not suitable for extracting In the text with the entity relationship of long range.And it is in parallel by the convolutional network of different convolution kernels, then it can carry out further Extract text sequence information.Simultaneously because this kind of model based on Recognition with Recurrent Neural Network (RNN) structure of BIlstm is able to solve sentence Distance length is too long between two entity words in son or two entity words between the dislocation this kind of comprising third entity word again Three continuous convolution kernels are respectively set in model E ncoder structure (as shown in Figure 4) in situation, therefore, the embodiment of the present invention Number carries out convolution text maninulation, after each later layer of CNN network by one layer of BIlstm in order to alleviating entity word long distance From problem brought by study.Finally by one layer of self-attention machining function, each word in text sequence is obtained Hidden state C.Secondly, being embedded in the stage in word, closed using word vector and each sentence by segmenting obtained participle vector And embedding operation is carried out, the effect learnt with this lift scheme.
It, will be from hidden layer tensor obtained in Encoder and T0 moment in model Decoder structure (as shown in Figure 5) Target tensor be added, be finally T1 by the obtained data of softmax to be trained in input model The data at moment.Wherein it should be noted that target data T0 and T1 moment are obtained by data preprocessing phase.
The embodiment of the present invention is using sentence reference character obtained in each sentence as target data sequence, and phase Corresponding sentence is training data sequence.By splicing "<EOS>" character to every a line target data, sequence is indicated Predict end position.At this time it can find that this belongs to the data at T1 moment after the data are aligned with target initial data. And corresponding is that the data at T0 moment are Target original data.Finally obtained Encoder and Decoder is spelled It picks up to train, model training stage as provided in an embodiment of the present invention.
Detailed descriptionthe model mathematical theory and careful operating procedure can be carried out model by following steps as follows Training:
S81, setting convolution nucleus number are respectively 3,4,5 convolutional neural networks abstraction sequence feature:
The word vector transverse dimensions of setting of the embodiment of the present invention are d, then pass through the obtained tensor dimension size of embeding layer For xi=(batch, d), can characterize are as follows:WhereinRepresenting matrix is pressed Row carries out concatenation, and n represents input example number.By be arranged different word window size h come rate sequence information, it is final every It is a as follows by the obtained eigenmatrix calculation formula of every layer of word window:
ci=f (w1xI:i+h-1+b)
Function f indicates that convolutional network activation primitive, i refer to tensor x in above-mentioned formulaiLine number, b represents the biasing in network , W1Hyper parameter to be trained is represented, is initialized as 0.After convolution operation, sequence { X1:h, X2:h+1..., XN-h+1:nIn Corresponding characteristic set D is produced, structure can be expressed as:
D=[c1, c2..., cn-h+1], D ∈ Rn-h+1
It closes in obtained D collection, is operated by maxpooling, choose Dj=max { D } as word window size be h when Obtained feature vector.
S82 is arranged Dropout layers, and neuron ratio setting is 50%:
Over-fitting is trained in order to prevent, and Dropout layers are provided with after it have passed through maximum pond layer, makes it The neuron of middle random screening 50% stops parameter and updates, and retains weight, and other 50% can still be declined by gradient Mode carries out parameter update.
S83 is arranged Bilstm layers, and neuron number is set as word vector transverse dimensions d:
For the contextual properties of tensor between good extraction neural network node, and reduce brought by long-distance dependence It influences, the embodiment of the present invention is handled by adding Bidirectional Lstm layers, and makes tensor by process of convolution Afterwards saved good sequence characteristic.Eigenmatrix D after Dropout layers of screeningj, it is input to subnetwork update Formula is as follows:
Forward direction updates calculating formula: ht1=f1(w21Dj+v21ht-1+b1)
It is backward to update calculating formula: ht2=f1(w22Dj+v22ht+1+b2)
Hidden layer output type: G=g (U | ht1;ht2|+c)
Wherein ht1Indicate the hidden state amount generated when LSTM forward calculation, ht2Indicate reversed (i.e. backward) calculating of LSTM The hidden layer of Shi Shengcheng.Wherein w, v indicate that training parameter, b are mutually deserved bias term, are initialized to 0.It finally obtains to open Measure characteristic sequence obtained by G is.A total of 3 identical branches are mentioned in parallel-connection structure progress text information in the embodiment of the present invention It takes, therefore acquired that characteristic sequence number-letter relation table identifies are as follows: G1, G2, G3.
S84 introduces Concatenate layers, merges to three obtained tensors of branch:
After previous processed, the sequence tensor that obtains respectively is respectively G1, G2, G3.Choose sequence tensor the last one Dimension, which carries out splicing, indicates the vector that each word is constituted to obtain new sequence every a line of tensor Gn, tensor Gn.
S85 introduces self-attention mechanism, generates corresponding hidden state layer C:
Herein will by embeding layer, treated that sequence tensor x regards source sequence as, be characterized as (X1, X2... ..., Xn), ruler Very little size is (batch, d).Sequence Gn is regarded as and is made of<key.value>such a key-value pair.And it is above-mentioned by insertion Operating obtained sequence tensor X must calculate as Query to carry out attention weighted value.Since Gn results from Xi, therefore will Attention mechanism is placed in Encoder structure and is trained.Query, key, value calculating formula, process are first defined first It is as follows:
Query=WQX
Key=WKGn
Value=WVGn
Wherein WQ、WK、WVFor to training parameter.Then it is indicated i-th in Query and sequence by calculating cosine similarity A keyword Keyi obtains similarity degree, then calculating process is as follows:
Simi=(Query × Keyi)/(||Query||×||Keyi||)
To which with softmax method the above results are normalized with the attention power that can be obtained by each key Weight ai.Calculating process is as follows, and wherein i indicates that each key is worth index, KeyiWhole expression, which is appointed, takes building every a line of Gn Word vector:
Hidden state parameter C in last EncoderjCalculation formula is as follows, and each keyword Keyi generates to obtain state ginseng Number CjIt may be different, wherein letter a indicates corresponding attention weight ai, L expression sequence length:
S86, BILSTM layer of two layers of construction are composed in series the end Decoder, completion model training stage:
Obtaining the hidden state layer C of the end Encoder generationjAfterwards, data prediction is obtained obtaining the T0 moment by us Target sequence data (Y1, Y2..., Yn) Embedding operation is carried out, matrix Y is obtained, by CjIt is defeated as the end Decoder with Y Enter result finally after softmax is acted on, exports the Target sequence data (Y at T1 moment1, Y2..., Yn,<EOS>).It is counted Formula is as follows:
Y|T=T1=f1 (Cj, Y |T=T0)
T1=T0+1
Wherein f1 refers to that decoder terminal type non-linear changes function, and T1 is T0 subsequent time.We select cross entropy to damage in training Function is lost as loss value Gradient Iteration process in training.Finally we combine loss change curve to choose suitable epoch, and The optimal model of index is saved after carrying out tune ginseng.
S9 treats forecast sample using optimal index model and is predicted, using Beamsearch strategy, it is general to choose arrangement Rate maximal term obtains the synonymous word sequence of sample to be predicted as output sequence.
As an optional embodiment of the embodiment of the present invention, forecast sample is treated using optimal index model and is carried out in advance Survey includes: using Beamsearch strategy, and setting Beamsearch size value is 3, chooses arrangement maximum probability item as output Sequence obtains the synonymous word sequence of sample to be predicted.
When it is implemented, being pre-processed to Training data data, every data line in the forecast period of model Splice "<GO>" character in front, indicates forecasting sequence starting position.Model prediction stage and training stage weight are shared, Therefore resulting text sequence can be subjected to predicted operation after embedding treated tensor input.In pre- test sample In this, the word that the word with " EI ", " I1 ", " E2 ", " I2 " this kind of mark in corresponding sentence is constituted is as Chinese synonymous Word.
In the model prediction stage, due to lacking target data.Therefore the sequence data collection that will be predicted is processed into coding After form, by its sequence end addition<EOS>sequence ends symbol.Beemsearch size=3 is set later, then by sequence It is input in model, using the sequence of previous time step as the input of next time step, each time step is exported all Stop prediction after final algorithm traverses<EOS>in sequence mark for the probability matrix of different text sequences.Finally, it selects Maximum one section of text sequence is selected in all output probabilities as forecasting sequence.
Assuming that there was only A in certain corpus, two words of B, then training process is as follows: first time period, which exports, generates A, B two The probability P (A) of word, P (B), then with [A, B]TAs the input of next time step, then this moment will export P (AB | A), this kind of sequence probability matrix of P (AA | A), P (AB | B) ... ... P (BB | B).Subsequent time period and so on, but whenever sequence After generating 3 words, it will selection retains in all sequences of 3 words in front arrangement, that maximum sequence retains in probability Get off.It eventually passes through above-mentioned all policies and just produces the synonymous mark sequence of entity word.After above-mentioned predicted operation, it can obtain The synonym of enterprise or this kind of entity word of tissue is to situation in a complete words, thus when avoiding user search, it is different Synonymous adverbial word brought by semantic understanding problem, the ambiguousness of title class entity word can be effectively reduced.User is using system During system, it also will increase synonym in backstage dynamic and collect.If user inputs relevant enterprise or tissue initialism, system Also the target quick positioning user searched, this will also greatly promote recall precision.
It can be seen that is provided through the embodiment of the present invention is disappeared based on sequence right enterprise or tissue Chinese entity Discrimination method brings possibility so that the disambiguation of Chinese entity is applied in the business of enterprise's the analysis of public opinion.The embodiment of the present invention mentions The method of confession has the advantages that
1, by constructing simple semantic template and artificial data labeling form, news corpus is crawled, so that building is effective Data set, in this, as model training basis.
2,500w news corpus building pre-training word word vector is crawled to obtain using itself, the control of word vector dimension is in 500 dimensions Degree.
3, carry out model construction, propose it is a kind of can from language model, good extraction text sequence contextual information Model structure.
Wherein, disambiguate that there are data sets to lack according to Chinese entity, existing method can not quantify and need to make a concrete analysis of Feature, this method first proposed a kind of scheme for constructing labeled data collection, realize and turn chaotic urtext data Turn to this function of training data under supervised learning.Secondly, method proposes a kind of new model structures to carry out text sequence Column processing, this method have the advantages that compared to conventional method
1, the conventional method that Chinese entity is disambiguated and is converted to classification processing is got rid of, such methods are largely known by building Know library and carry out rule match, to carry out synonym.This method cost is excessively complicated, inconvenient.
2, it for tradition is based on this kind of statistical machine learning method of Hidden Markov, is generated relative to word frequency mode For text vector, this method can preferably extract text sequence feature, can be improved model and disambiguate this for Chinese entity Adaptability under kind scene.
3, this method can preferably cope with text sequence study such case over long distances relative to conventional method.
Fig. 6 shows the enterprise provided in an embodiment of the present invention based on recognition sequence or tissue Chinese entity disambiguates dress The structural schematic diagram set is somebody's turn to do enterprise or tissue Chinese entity disambiguator based on recognition sequence and is based on sequence applied to above-mentioned Arrange identification enterprise or tissue Chinese entity disambiguation method, below only to based on recognition sequence enterprise or tissue Chinese name Claim entity disambiguator structure be briefly described, other unaccomplished matters, please refer to the above-mentioned enterprise based on recognition sequence or The related description of Chinese entity disambiguation method is organized, details are not described herein.Referring to Fig. 6, base provided in an embodiment of the present invention Enterprise or tissue Chinese entity disambiguator in recognition sequence, comprising:
Data set constructs module 601, for crawling disclosed news data collection and carrying out data cleansing to news data collection, Data after being cleaned;The entity word in data after extracting cleaning, obtains preliminary specification data;Wherein, entity word includes At least one of: company name COM and organization name ORG;Semantic template rule is set, preliminary specification data are screened, are obtained To authority data;The determining synonymous standard words in authority data and synonymous adverbial word, clearly to the synonym in authority data It is right;
Data labeling module 602 is treated authority data and is labeled, add artificial structure for setting data mark strategy It builds data and carries out data enhancing, obtain training data;
Vector training module 603, for utilizing preliminary specification data pre-training word vector and term vector, by word vector and word Vector vertical direction merges to obtain new vector;
Preprocessing module 604, for pre-processing training data;
Model training module 605, for constructing model to pretreated trained number using Encoder Decoder structure According to being trained, optimal index model is saved;
Prediction module 606 is predicted for treating forecast sample using optimal index model, uses Beamsearch plan Slightly, arrangement maximum probability item is chosen as output sequence, obtains the synonymous word sequence of sample to be predicted.
It can be seen that the enterprise based on recognition sequence or tissue Chinese entity that provide through the embodiment of the present invention disappear Discrimination device brings possibility so that the disambiguation of Chinese entity is applied in the business of enterprise's the analysis of public opinion.
As an optional embodiment of the embodiment of the present invention, data set building module 601 crawls in the following way Disclosed news data collection simultaneously carries out data cleansing to news data collection, and the data after being cleaned: data set constructs module 601, specifically for crawling the disclosed country, economy, science and technology news data, removal spcial character and meaningless symbol, and examine It looks into whether there is or not null value, the item data, the data after being cleaned is removed if having null value.
As an optional embodiment of the embodiment of the present invention, data set building module 601 is extracted in the following way The entity word in data after cleaning, obtain preliminary specification data: data set constructs module 601, is specifically used for using instruction in advance The Chinese Named Entity Extraction Model perfected handles the data after cleaning, extracts company name and organization name in each sentence Entity word carries out long and short verse and compiles as training supplement corpus, and each sentence number of words control is within default number of words.
As an optional embodiment of the embodiment of the present invention, data set building module 601 is arranged in the following way Semantic template rule, screens preliminary specification data, obtain to authority data: data set constructs module 601, is specifically used for Pass through setting "<ORG>+verb+< ORG>/< COM>+verb+< ORG>/< ORG>+verb+< COM>/< COM>+verb+< COM>" High-frequency verb dictionary in pattern rule and artificial constructed field carries out data screening, obtains to authority data.
As an optional embodiment of the embodiment of the present invention, data set building module 601 is determined as follows To in authority data synonymous standard words and synonymous adverbial word, clearly to the synonym pair in authority data: data set construct module 601, specifically for that will be determined another in sentence to the entity word in authority data in a sentence more than number of words as synonymous standard words One entity word and synonymous standard words belong to same class, and each word is contained in synonymous standard words in another entity word In language, and the number of words of another entity word is greater than 1, then using another entity word as synonymous adverbial word, determines synonymous in sentence Standard words and synonymous adverbial word belong to synonym pair.
As an optional embodiment of the embodiment of the present invention, data labeling module 602 sets number in the following way It according to mark strategy, treats authority data and is labeled, add artificial constructed data and carry out data enhancing, obtain training data: number According to labeling module 602, specifically for that will be marked to the first character in authority data in each sentence in synonymous standard words SEi, other words in synonymous standard words are labeled as Ii, and the first character in synonymous adverbial word is labeled as E2, its in synonymous adverbial word His word is labeled as I2, and word of other in sentence not in synonym pair is labeled as O;The artificial supplementation number of semantic template rule will be met According to being added in preliminary specification data, training data is obtained, wherein artificial supplementation data are passed through at random alignment with to authority data Reason mixes.
As an optional embodiment of the embodiment of the present invention, vector training module 603 is in the following way using just Authority data pre-training word vector and term vector are walked, merges word vector and term vector vertical direction to obtain new vector: to Training module 603 is measured, specifically for the Word2vec model using Skip structure, training obtains word vector, entity word is added The dictionary formed after participle, training obtain term vector, and term vector is carried out vertical direction with word vector and merges to obtain new vector.
As an optional embodiment of the embodiment of the present invention, preprocessing module 604 pre-processes instruction in the following way Practice data: preprocessing module 604, specifically for training data is isolated annotated sequence and Chinese sequence, to Chinese sequence into The filtering of row stop words establishes dictionary and according to dictionary index by text sequence coded treatment.
As an optional embodiment of the embodiment of the present invention, model training module 605 utilizes in the following way Encoder Decoder structure building model is trained pretreated training data, saves optimal index model: model Training module 605 in encoder encoder, leads to specifically for using structure for the model of Encoder-Decoder structure The convolutional neural networks abstraction sequence feature that setting convolution nucleus number is respectively 3,4,5 is crossed, passes through a forward-backward recutrnce nerve respectively Network is serialized, and self attention is added and generates the centre that corresponding attention weight is exported as the end Encoder State value constitutes decoder by 2 layers of forward-backward recutrnce neural network at the end decoder, respectively by the target sequence of previous moment In column input decoder, is acted on intermediate state layer generation, generate the target sequence of next time step.
As an optional embodiment of the embodiment of the present invention, prediction module 606 utilizes optimal finger in the following way Mark model is treated forecast sample and predicted: prediction module 606 is specifically used for using Beamsearch strategy, setting Beamsearch size value is 3, chooses arrangement maximum probability item as output sequence, obtains the synonymous word order of sample to be predicted Column.
It should be understood by those skilled in the art that, embodiments herein can provide as method, system or computer program Product.Therefore, complete hardware embodiment, complete software embodiment or reality combining software and hardware aspects can be used in the application Apply the form of example.Moreover, it wherein includes the computer of computer usable program code that the application, which can be used in one or more, The computer program implemented in usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) produces The form of product.
The application is referring to method, the process of equipment (system) and computer program product according to the embodiment of the present application Figure and/or block diagram describe.It should be understood that every one stream in flowchart and/or the block diagram can be realized by computer program instructions The combination of process and/or box in journey and/or box and flowchart and/or the block diagram.It can provide these computer programs Instruct the processor of general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to produce A raw machine, so that being generated by the instruction that computer or the processor of other programmable data processing devices execute for real The device for the function of being specified in present one or more flows of the flowchart and/or one or more blocks of the block diagram.
These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates, Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or The function of being specified in multiple boxes.
These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one The step of function of being specified in a box or multiple boxes.
In a typical configuration, calculating equipment includes one or more processors (CPU), input/output interface, net Network interface and memory.
Memory may include the non-volatile memory in computer-readable medium, random access memory (RAM) and/ Or the forms such as Nonvolatile memory, such as read-only memory (ROM) or flash memory (flash RAM).Memory is computer-readable Jie The example of matter.
Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method Or technology come realize information store.Information can be computer readable instructions, data structure, the module of program or other data. The example of the storage medium of computer includes, but are not limited to phase change memory (PRAM), static random access memory (SRAM), moves State random access memory (DRAM), other kinds of random access memory (RAM), read-only memory (ROM), electric erasable Programmable read only memory (EEPROM), flash memory or other memory techniques, read-only disc read only memory (CD-ROM) (CD-ROM), Digital versatile disc (DVD) or other optical storage, magnetic cassettes, tape magnetic disk storage or other magnetic storage devices Or any other non-transmission medium, can be used for storage can be accessed by a computing device information.As defined in this article, it calculates Machine readable medium does not include temporary computer readable media (transitory media), such as the data-signal and carrier wave of modulation.
The above is only embodiments herein, are not intended to limit this application.To those skilled in the art, Various changes and changes are possible in this application.It is all within the spirit and principles of the present application made by any modification, equivalent replacement, Improve etc., it should be included within the scope of the claims of this application.

Claims (10)

1. a kind of enterprise or tissue Chinese entity disambiguation method based on recognition sequence characterized by comprising
It crawls disclosed news data collection and data cleansing is carried out to the news data collection, the data after being cleaned;
The entity word in data after extracting the cleaning, obtains preliminary specification data;Wherein, the entity word include with down toward It is one of few: company name COM and organization name ORG;
Semantic template rule is set, the preliminary specification data are screened, are obtained to authority data;
Determine the synonymous standard words in authority data and synonymous adverbial word, it is clear described to the synonym in authority data It is right;
Data mark strategy is set, is labeled to described to authority data, artificial constructed data is added and carries out data enhancing, obtain To training data;
Using the preliminary specification data pre-training word vector and term vector, by the word vector and the term vector vertical direction It merges to obtain new vector;
Pre-process the training data;
The pretreated training data is trained using Encoder Decoder structure building model, is saved optimal Index model;
Forecast sample is treated using the optimal index model to be predicted, using Beamsearch strategy, chooses arrangement probability Maximal term obtains the synonymous word sequence of the sample to be predicted as output sequence.
2. the method according to claim 1, wherein described crawl disclosed news data collection and to the news Data set carries out data cleansing, and the data after being cleaned include:
The disclosed country, economy, science and technology news data, removal spcial character and meaningless symbol are crawled, and checks for sky Value, removes the item data, the data after obtaining the cleaning if having null value;
The entity word in data after the extraction cleaning, obtaining preliminary specification data includes:
The data after the cleaning are handled using Chinese Named Entity Extraction Model trained in advance, extract each sentence The entity word of middle company name and organization name carries out long and short verse and compiles as training supplement corpus, each sentence number of words Control is within default number of words;
Setting semantic template rule, screens the preliminary specification data, obtain include: to authority data
By setting "<ORG>+verb+< ORG>/< COM>+verb+< ORG>/< ORG>+verb+< COM>/< COM>+verb+ High-frequency verb dictionary in < COM > " pattern rule and artificial constructed field carries out data screening, obtains described to authority data;
And/or
Described in the determination in authority data synonymous standard words and synonymous adverbial word, it is clear described to synonymous in authority data Word is to including:
Using the entity word in authority data in a sentence more than number of words as synonymous standard words, determine another in the sentence One entity word and the synonymous standard words belong to same class, and each word is contained in institute in another described entity word It states in synonymous standard word, and the number of words of another entity word is greater than 1, then by another described entity word as synonymous pair Word determines that the synonymous standard words in the sentence and the synonymous adverbial word belong to synonym pair.
3. the method according to claim 1, wherein
The setting data mark strategy, is labeled to described to authority data, adds artificial constructed data and carries out data increasing By force, obtaining training data includes:
The first character in authority data in synonymous standard words described in each sentence is marked into SEi, it is described synonymous Other words in standard words are labeled as Ii, and the first character in the synonymous adverbial word is labeled as E2, its in the synonymous adverbial word His word is labeled as I2, and word of other in sentence not in synonym pair is labeled as O;
The artificial supplementation data for meeting semantic template rule are added in the preliminary specification data, the training data is obtained, Wherein, the artificial supplementation data are mixed to authority data by random alignment processing with described;
Utilization the preliminary specification data pre-training word vector and the term vector, the word vector is vertical with the term vector Direction merges to obtain new vector
Using the Word2vec model of Skip structure, training obtains the word vector, is formed after participle is added in the entity word Dictionary, training obtains the term vector, by the term vector and the word vector carry out vertical direction merge to obtain it is described newly Vector;And/or
The pretreatment training data includes:
The training data is isolated into annotated sequence and Chinese sequence, stop words filtering is carried out to the Chinese sequence, is established Dictionary and according to dictionary index by text sequence coded treatment.
4. the method according to claim 1, wherein described construct model using Encoder Decoder structure The pretreated training data is trained, saving optimal index model includes:
It uses structure for the model of Encoder-Decoder structure, in encoder encoder, passes through setting convolution nucleus number point Not Wei 3,4,5 convolutional neural networks abstraction sequence feature, serialized by a forward-backward recutrnce neural network, added respectively Enter self attention and generate the intermediate state value that corresponding attention weight is exported as the end Encoder, at the end decoder Decoders are constituted by 2 layers of forward-backward recutrnce neural network, respectively by the target sequence inputting decoder of previous moment, with Intermediate state layer generation effect, generates the target sequence of next time step.
5. according to the method described in claim 4, it is characterized in that, described treat forecast sample using the optimal index model Carrying out prediction includes:
Using Beamsearch strategy, setting Beamsearch size value is 3, chooses arrangement maximum probability item as output sequence Column, obtain the synonymous word sequence of the sample to be predicted.
6. a kind of enterprise or tissue Chinese entity disambiguator based on recognition sequence characterized by comprising
Data set constructs module, for crawling disclosed news data collection and carrying out data cleansing to the news data collection, obtains Data after to cleaning;The entity word in data after extracting the cleaning, obtains preliminary specification data;Wherein, the entity Word includes at least one of: company name COM and organization name ORG;Be arranged semantic template rule, to the preliminary specification data into Row screening, obtains to authority data;Determine the synonymous standard words in authority data and synonymous adverbial word, it is clear described wait advise Synonym pair of the norm in;
Data labeling module is labeled to authority data to described for setting data mark strategy, adds artificial constructed number According to data enhancing is carried out, training data is obtained;
Vector training module, for utilize the preliminary specification data pre-training word vector and term vector, by the word vector with The term vector vertical direction merges to obtain new vector;
Preprocessing module, for pre-processing the training data;
Model training module, for constructing model to the pretreated training data using Encoder Decoder structure It is trained, saves optimal index model;
Prediction module is predicted for treating forecast sample using the optimal index model, tactful using Beamsearch, Arrangement maximum probability item is chosen as output sequence, obtains the synonymous word sequence of the sample to be predicted.
7. device according to claim 6, which is characterized in that
The data set building module crawls disclosed news data collection in the following way and carries out to the news data collection Data cleansing, the data after being cleaned:
The data set building module removes special word specifically for crawling the disclosed country, economy, science and technology news data Symbol and meaningless symbol, and null value is checked for, the item data is removed if having null value, the data after obtaining the cleaning;
The data set building module extracts the entity word in the data after the cleaning in the following way, obtains preliminary specification Data:
The data set constructs module, is specifically used for using Chinese Named Entity Extraction Model trained in advance to the cleaning Data afterwards are handled, and are extracted the entity word of company name and organization name in each sentence and are carried out simultaneously as training supplement corpus Long and short verse is compiled, and each sentence number of words control is within default number of words;
Semantic template rule is arranged in the data set building module in the following way, sieves to the preliminary specification data Choosing, obtains to authority data:
The data set constructs module, be specifically used for by setting "<ORG>+verb+< ORG>/< COM>+verb+< ORG>/ High-frequency verb dictionary in < ORG >+verb+< COM >/< COM >+verb+< COM > " pattern rule and artificial constructed field, is counted According to screening, obtain described to authority data;
And/or
The data set building module is determined as follows the synonymous standard words in authority data and synonymous adverbial word, It is clear described to the synonym pair in authority data:
The data set constructs module, specifically for using the entity word in authority data in a sentence more than number of words as Synonymous standard words determine in the sentence that another entity word and the synonymous standard words belong to same class, and it is described another Each word is contained in the synonymous standard word in entity word, and the number of words of another entity word is greater than 1, then By another described entity word as synonymous adverbial word, determine that the synonymous standard words in the sentence and the synonymous adverbial word belong to together Adopted word pair.
8. device according to claim 6, which is characterized in that
The data labeling module sets data mark strategy in the following way, is labeled, adds to authority data to described Add artificial constructed data to carry out data enhancing, obtain training data:
The data labeling module, specifically for by described in authority data in synonymous standard words described in each sentence First character marks SEi, other words in the synonymous standard words are labeled as Ii, the first character mark in the synonymous adverbial word For E2, other words in the synonymous adverbial word are labeled as I2, and word of other in sentence not in synonym pair is labeled as O;It will meet The artificial supplementation data of semantic template rule are added in the preliminary specification data, obtain the training data, wherein the people Work supplementary data is mixed to authority data by random alignment processing with described;
The vector training module utilizes the preliminary specification data pre-training word vector and term vector in the following way, by institute It states word vector and the term vector vertical direction merges to obtain new vector:
The vector training module, specifically for the Word2vec model using Skip structure, training obtains the word vector, will The dictionary formed after participle is added in the entity word, and training obtains the term vector, by the term vector and the word vector into Row vertical direction merges to obtain the new vector;
And/or
The preprocessing module pre-processes the training data in the following way:
The preprocessing module, specifically for the training data is isolated annotated sequence and Chinese sequence, to the Chinese Sequence carries out stop words filtering, establishes dictionary and according to dictionary index by text sequence coded treatment.
9. device according to claim 6, which is characterized in that the model training module utilizes in the following way Encoder Decoder structure building model is trained the pretreated training data, saves optimal index model:
The model training module, specifically for using structure for the model of Encoder-Decoder structure, in encoder In encoder, the convolutional neural networks abstraction sequence feature for being respectively 3,4,5 by the way that convolution nucleus number is arranged passes through one respectively Forward-backward recutrnce neural network is serialized, and self attention is added and generates corresponding attention weight as Encoder The intermediate state value for holding output, when the end decoder constitutes decoder by 2 layers of forward-backward recutrnce neural network, respectively will be previous In the target sequence inputting decoder at quarter, is acted on intermediate state layer generation, generate the target sequence of next time step Column.
10. device according to claim 9, which is characterized in that the prediction module utilize in the following way it is described most Excellent index model is treated forecast sample and is predicted:
The prediction module is specifically used for using Beamsearch strategy, and setting Beamsearch size value is 3, chooses arrangement Maximum probability item obtains the synonymous word sequence of the sample to be predicted as output sequence.
CN201910297022.1A 2019-04-15 2019-04-15 Sequence identification based enterprise or organization Chinese name entity disambiguation method and device Active CN110020438B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910297022.1A CN110020438B (en) 2019-04-15 2019-04-15 Sequence identification based enterprise or organization Chinese name entity disambiguation method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910297022.1A CN110020438B (en) 2019-04-15 2019-04-15 Sequence identification based enterprise or organization Chinese name entity disambiguation method and device

Publications (2)

Publication Number Publication Date
CN110020438A true CN110020438A (en) 2019-07-16
CN110020438B CN110020438B (en) 2020-12-08

Family

ID=67191295

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910297022.1A Active CN110020438B (en) 2019-04-15 2019-04-15 Sequence identification based enterprise or organization Chinese name entity disambiguation method and device

Country Status (1)

Country Link
CN (1) CN110020438B (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110516233A (en) * 2019-08-06 2019-11-29 深圳和而泰家居在线网络科技有限公司 Method, apparatus, terminal device and the storage medium of data processing
CN111079435A (en) * 2019-12-09 2020-04-28 深圳追一科技有限公司 Named entity disambiguation method, device, equipment and storage medium
CN111079418A (en) * 2019-11-06 2020-04-28 科大讯飞股份有限公司 Named body recognition method and device, electronic equipment and storage medium
CN111259087A (en) * 2020-01-10 2020-06-09 中国科学院软件研究所 Computer network protocol entity linking method and system based on domain knowledge base
CN111339319A (en) * 2020-03-02 2020-06-26 北京百度网讯科技有限公司 Disambiguation method and device for enterprise name, electronic equipment and storage medium
CN111581335A (en) * 2020-05-14 2020-08-25 腾讯科技(深圳)有限公司 Text representation method and device
CN111737407A (en) * 2020-08-25 2020-10-02 成都数联铭品科技有限公司 Event unique ID construction method based on event disambiguation
CN111814479A (en) * 2020-07-09 2020-10-23 上海明略人工智能(集团)有限公司 Enterprise short form generation and model training method and device
CN112017643A (en) * 2020-08-24 2020-12-01 广州市百果园信息技术有限公司 Speech recognition model training method, speech recognition method and related device
CN112069826A (en) * 2020-07-15 2020-12-11 浙江工业大学 Vertical domain entity disambiguation method fusing topic model and convolutional neural network
WO2021139257A1 (en) * 2020-06-24 2021-07-15 平安科技(深圳)有限公司 Method and apparatus for selecting annotated data, and computer device and storage medium
CN113326380A (en) * 2021-08-03 2021-08-31 国能大渡河大数据服务有限公司 Equipment measurement data processing method, system and terminal based on deep neural network
CN113609825A (en) * 2021-10-11 2021-11-05 北京百炼智能科技有限公司 Intelligent customer attribute tag identification method and device
CN113761942A (en) * 2021-09-14 2021-12-07 合众新能源汽车有限公司 Semantic analysis method and device based on deep learning model and storage medium
US20220083733A1 (en) * 2019-12-05 2022-03-17 Boe Technology Group Co., Ltd. Synonym mining method, application method of synonym dictionary, medical synonym mining method, application method of medical synonym dictionary, synonym mining device and storage medium
CN114398492A (en) * 2021-12-24 2022-04-26 森纵艾数(北京)科技有限公司 Knowledge graph construction method, terminal and medium in digital field
US20220171924A1 (en) * 2019-03-25 2022-06-02 Nippon Telegraph And Telephone Corporation Index value giving apparatus, index value giving method and program

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2008106473A1 (en) * 2007-02-26 2008-09-04 Microsoft Corporation Automatic disambiguation based on a reference resource
US8856119B2 (en) * 2009-02-27 2014-10-07 International Business Machines Corporation Holistic disambiguation for entity name spotting
CN104111973A (en) * 2014-06-17 2014-10-22 中国科学院计算技术研究所 Scholar name duplication disambiguation method and system
CN104268200A (en) * 2013-09-22 2015-01-07 中科嘉速(北京)并行软件有限公司 Unsupervised named entity semantic disambiguation method based on deep learning
CN106407180A (en) * 2016-08-30 2017-02-15 北京奇艺世纪科技有限公司 Entity disambiguation method and apparatus
CN107391613A (en) * 2017-07-04 2017-11-24 北京航空航天大学 A kind of automatic disambiguation method of more documents of industry security theme and device
CN107784125A (en) * 2017-11-24 2018-03-09 中国银行股份有限公司 A kind of entity relation extraction method and device
CN108572960A (en) * 2017-03-08 2018-09-25 富士通株式会社 Place name disappears qi method and place name disappears qi device

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2008106473A1 (en) * 2007-02-26 2008-09-04 Microsoft Corporation Automatic disambiguation based on a reference resource
US8856119B2 (en) * 2009-02-27 2014-10-07 International Business Machines Corporation Holistic disambiguation for entity name spotting
CN104268200A (en) * 2013-09-22 2015-01-07 中科嘉速(北京)并行软件有限公司 Unsupervised named entity semantic disambiguation method based on deep learning
CN104111973A (en) * 2014-06-17 2014-10-22 中国科学院计算技术研究所 Scholar name duplication disambiguation method and system
CN106407180A (en) * 2016-08-30 2017-02-15 北京奇艺世纪科技有限公司 Entity disambiguation method and apparatus
CN108572960A (en) * 2017-03-08 2018-09-25 富士通株式会社 Place name disappears qi method and place name disappears qi device
CN107391613A (en) * 2017-07-04 2017-11-24 北京航空航天大学 A kind of automatic disambiguation method of more documents of industry security theme and device
CN107784125A (en) * 2017-11-24 2018-03-09 中国银行股份有限公司 A kind of entity relation extraction method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ANGELL.GARRIDO ET AL.: "NEREA:Named Entity Recognition and Disambiguation Exploiting Local Document Repositories", 《2016 IEEE 28TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE》 *
葛斌 等: "基于模板的无导词义消歧方法", 《计算机工程与科学》 *

Cited By (29)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220171924A1 (en) * 2019-03-25 2022-06-02 Nippon Telegraph And Telephone Corporation Index value giving apparatus, index value giving method and program
US11960836B2 (en) * 2019-03-25 2024-04-16 Nippon Telegraph And Telephone Corporation Index value giving apparatus, index value giving method and program
CN110516233B (en) * 2019-08-06 2023-08-01 深圳数联天下智能科技有限公司 Data processing method, device, terminal equipment and storage medium
CN110516233A (en) * 2019-08-06 2019-11-29 深圳和而泰家居在线网络科技有限公司 Method, apparatus, terminal device and the storage medium of data processing
CN111079418A (en) * 2019-11-06 2020-04-28 科大讯飞股份有限公司 Named body recognition method and device, electronic equipment and storage medium
CN111079418B (en) * 2019-11-06 2023-12-05 科大讯飞股份有限公司 Named entity recognition method, device, electronic equipment and storage medium
US20220083733A1 (en) * 2019-12-05 2022-03-17 Boe Technology Group Co., Ltd. Synonym mining method, application method of synonym dictionary, medical synonym mining method, application method of medical synonym dictionary, synonym mining device and storage medium
US11977838B2 (en) * 2019-12-05 2024-05-07 Boe Technology Group Co., Ltd. Synonym mining method, application method of synonym dictionary, medical synonym mining method, application method of medical synonym dictionary, synonym mining device and storage medium
CN111079435A (en) * 2019-12-09 2020-04-28 深圳追一科技有限公司 Named entity disambiguation method, device, equipment and storage medium
CN111259087A (en) * 2020-01-10 2020-06-09 中国科学院软件研究所 Computer network protocol entity linking method and system based on domain knowledge base
CN111259087B (en) * 2020-01-10 2022-10-14 中国科学院软件研究所 Computer network protocol entity linking method and system based on domain knowledge base
CN111339319A (en) * 2020-03-02 2020-06-26 北京百度网讯科技有限公司 Disambiguation method and device for enterprise name, electronic equipment and storage medium
CN111339319B (en) * 2020-03-02 2023-08-04 北京百度网讯科技有限公司 Enterprise name disambiguation method and device, electronic equipment and storage medium
CN111581335B (en) * 2020-05-14 2023-11-24 腾讯科技(深圳)有限公司 Text representation method and device
CN111581335A (en) * 2020-05-14 2020-08-25 腾讯科技(深圳)有限公司 Text representation method and device
WO2021139257A1 (en) * 2020-06-24 2021-07-15 平安科技(深圳)有限公司 Method and apparatus for selecting annotated data, and computer device and storage medium
CN111814479A (en) * 2020-07-09 2020-10-23 上海明略人工智能(集团)有限公司 Enterprise short form generation and model training method and device
CN111814479B (en) * 2020-07-09 2023-08-25 上海明略人工智能(集团)有限公司 Method and device for generating enterprise abbreviations and training model thereof
CN112069826B (en) * 2020-07-15 2021-12-07 浙江工业大学 Vertical domain entity disambiguation method fusing topic model and convolutional neural network
CN112069826A (en) * 2020-07-15 2020-12-11 浙江工业大学 Vertical domain entity disambiguation method fusing topic model and convolutional neural network
CN112017643B (en) * 2020-08-24 2023-10-31 广州市百果园信息技术有限公司 Speech recognition model training method, speech recognition method and related device
CN112017643A (en) * 2020-08-24 2020-12-01 广州市百果园信息技术有限公司 Speech recognition model training method, speech recognition method and related device
CN111737407A (en) * 2020-08-25 2020-10-02 成都数联铭品科技有限公司 Event unique ID construction method based on event disambiguation
CN113326380A (en) * 2021-08-03 2021-08-31 国能大渡河大数据服务有限公司 Equipment measurement data processing method, system and terminal based on deep neural network
CN113761942A (en) * 2021-09-14 2021-12-07 合众新能源汽车有限公司 Semantic analysis method and device based on deep learning model and storage medium
CN113761942B (en) * 2021-09-14 2023-12-05 合众新能源汽车股份有限公司 Semantic analysis method, device and storage medium based on deep learning model
CN113609825A (en) * 2021-10-11 2021-11-05 北京百炼智能科技有限公司 Intelligent customer attribute tag identification method and device
CN114398492A (en) * 2021-12-24 2022-04-26 森纵艾数(北京)科技有限公司 Knowledge graph construction method, terminal and medium in digital field
CN114398492B (en) * 2021-12-24 2022-08-30 森纵艾数(北京)科技有限公司 Knowledge graph construction method, terminal and medium in digital field

Also Published As

Publication number Publication date
CN110020438B (en) 2020-12-08

Similar Documents

Publication Publication Date Title
CN110020438A (en) Enterprise or tissue Chinese entity disambiguation method and device based on recognition sequence
WO2022141878A1 (en) End-to-end language model pretraining method and system, and device and storage medium
CN110209822A (en) Sphere of learning data dependence prediction technique based on deep learning, computer
CN117236338B (en) Named entity recognition model of dense entity text and training method thereof
CN114818717A (en) Chinese named entity recognition method and system fusing vocabulary and syntax information
CN112818698A (en) Fine-grained user comment sentiment analysis method based on dual-channel model
CN114282592A (en) Deep learning-based industry text matching model method and device
CN114398900A (en) Long text semantic similarity calculation method based on RoBERTA model
CN113869054A (en) Deep learning-based electric power field project feature identification method
Xue et al. A method of chinese tourism named entity recognition based on bblc model
Xu et al. Short text classification of chinese with label information assisting
Zhao et al. Chinese named entity recognition in power domain based on Bi-LSTM-CRF
CN109117471A (en) A kind of calculation method and terminal of the word degree of correlation
CN116680407A (en) Knowledge graph construction method and device
CN114548090B (en) Fast relation extraction method based on convolutional neural network and improved cascade labeling
Wang et al. Predicting the Chinese poetry prosodic based on a developed BERT model
CN112613316B (en) Method and system for generating ancient Chinese labeling model
CN114238649A (en) Common sense concept enhanced language model pre-training method
Xu et al. Causal event extraction using causal event element-oriented neural network
Wu A computational neural network model for college English grammar correction
Yao et al. Heterogeneous Graph Neural Network for Chinese Financial Event Extraction
Ding et al. Graph structure-aware bi-directional graph convolution model for semantic role labeling
Chen Automatic Assessment Method of Oral English Based on Multimodality
Liu et al. Text Analysis of Community Governance Case based on Entity and Relation Extraction
CN115114915B (en) Phrase identification method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant