CN107908642A

CN107908642A - Industry text entities extracting method based on distributed platform

Info

Publication number: CN107908642A
Application number: CN201710902720.0A
Authority: CN
Inventors: 武克杰; 周书勇
Original assignee: Jiangsu Huatong Sheng Yun Technology Co Ltd
Current assignee: Jiangsu Huatong Sheng Yun Technology Co Ltd
Priority date: 2017-09-29
Filing date: 2017-09-29
Publication date: 2018-04-13
Anticipated expiration: 2037-09-29
Also published as: CN107908642B

Abstract

The invention discloses a kind of industry text entities extracting method based on distributed platform, including：Relationship characteristic model is obtained using deep learning neural metwork training text data set；The relationship characteristic of extraction is generated into multiple elasticity distribution formula relationship characteristic data set RDD；The category feature model extraction category feature that data set in RDD is obtained by improved non-linear svm classifier Algorithm for Training；Corresponding linguistic context physical model is found according to the category feature of extraction, and the solid data in the text for corresponding to category feature is extracted by trained physical model；Judge whether this quantity of corresponding linguistic context text exceedes given threshold, if exceed threshold value, the re -training linguistic context physical model, utilizes the solid data in the text of the corresponding category feature of physical model extraction of re -training, otherwise, text entities feature and text data are preserved.The text feature entity under different context can be handled, effectively increases the efficiency and extraction entity accuracy rate of entity extraction.

Description

Industry text entities extracting method based on distributed platform

Technical field

The present invention relates to a kind of extracting method of text entities, more particularly to a kind of industry text based on distributed platform This entity extraction method.

Background technology

Traditional Text Extraction is using pattern match Relation extraction method, the Relation extraction based on dictionary driving, base In Relation extraction method of machine learning etc., these first most of methods are that word frequency is higher in the method extraction text by participle Word as effective entity.These methods are suitable for the relatively simple scene of entity in text, but under different context, these Method cannot effectively distinguish entity under different context, need not will can split originally or the segmentation and conjunction of merged entity mistake And.

Meanwhile word of the traditional detection method to the mistake for not having to occur in former text, it is difficult to be carried out by segmenting method Extraction.

Occur many extraction instance methods based on deep learning in the recent period, wherein extraction entity algorithm is divided into calculated performance It is not higher that relatively good but extraction is accurate, two kinds of models that extraction accuracy is higher but calculated performance is slow.Such as fast linear Entity extraction model, convolutional neural networks are exactly accelerated model, and non-linear entity extraction model, deep neural network model are exactly The relatively good model of accuracy.

It is real that Chinese patent literature CN2017100036859 discloses a kind of online traditional Chinese medical science text name based on deep learning Body recognition methods, the entity extraction method are carried by reptile rich text training sample set, while using the method for neutral net Text feature is taken, this can extract the accuracy of the entity of sample to a certain extent, but with the increase pair of training sample The extraction physical model answered also increases, while the time of training can gradually increase, while extracts the characteristic time also with increase.

The content of the invention

For above-mentioned technical problem, the present invention seeks to：A kind of industry text entities based on distributed platform are provided to carry Method is taken, using multiple elasticity distribution formula entity extraction models in Spark platforms, the text feature handled under different context is real Body, so can effectively improve the efficiency of entity extraction, can also improve extraction entity accuracy rate.At the same time by supporting vector Weights are improved in machine sorting algorithm, enhance the generalization ability of text, the further accuracy of text.

The technical scheme is that：

A kind of industry text entities extracting method based on distributed platform, comprises the following steps：

S01：Relationship characteristic model is obtained using deep learning neural metwork training text data set, and passes through relationship characteristic Relationship characteristic in model extraction target text；

S02：The relationship characteristic of extraction is generated into multiple elasticity distribution formula relationship characteristic data set RDD；

S03：The category feature model that data set in RDD is obtained by improved non-linear svm classifier Algorithm for Training Extract category feature；

S04：Corresponding linguistic context physical model is found according to the category feature of extraction, and is extracted by trained physical model Solid data in the text of corresponding category feature；

S05：Judge whether this quantity of corresponding linguistic context text exceedes given threshold T, if exceed threshold value T, re -training should Linguistic context physical model, using the solid data in the text of the corresponding category feature of physical model extraction of re -training, otherwise, is protected Deposit text entities feature and text data.

Preferably, the step S01 is specifically included：

S11：Text is segmented by ansj segmenting methods of increasing income, count word frequency of each word in all texts and Word frequency in current text, removes general auxiliary word, stop words and the high word of frequency, by all texts according to ought be above The relation of word frequency in this and the word frequency in all texts, extracts N number of word, will be placed on per one kind in same file folder；

S12：Each word in N number of word is randomly set to the data characteristics of A dimensions, each text forms N*A dimension datas；

S13：Using each word feature as deep learning neutral net input node neuron, then pass through the first hidden layer Convolution is carried out, sub-sample and local average are carried out by the second hidden layer, second of convolution is carried out by the 3rd hidden layer, is led to Cross the 4th hidden layer and carry out second of sub-sample and the calculation of local average juice, full articulamentum, convert text to B dimension datas, lead to Multiple testing and debugging accuracy is crossed, obtains relationship characteristic model.

Preferably, the step S03 is specifically included：

S31：The weight and offset in non-linear svm classifier algorithm are adjusted, makes the relationship characteristic of input and has marked The error of the feature of sample preserves the category feature model of text in setting range；

S32：The disaggregated model method of selection is improved non-linear svm classifier algorithm, its training pattern class object letter Number isWhereinForecast classification condition is y=w' φ (x_i)+ b+ε_i, obtain discriminant functionWherein weightsC is penalty factor, is one Empirical parameter, i are RDD numbers, and w is vectorial weight, s_iIt is the Euclidean distance of positive sample and negative sample in relationship characteristic, b is point Threshold value during class, ε_iFor error, φ (x_i) it is Non-linear Kernel function；

S32：Gradually adjustment penalty factor, test select optimal penalty factor, wherein Non-linear Kernel function phi (x_i) For min (x (i), x_s(i)), wherein x (i), x_s(i) it is feature vector that any two text relationship characteristic sample extraction arrives；Often The label of class relationship characteristic sample is corresponding classification number, and the α of discriminant function is obtained by multiple off-line training_iAnd b, wherein sentencing Other functionIt is exactly corresponding category feature model.

Preferably, in the step S03, bad and sample text that is having apparent error will be extracted and be put into new class, by Step section test sample so that test sample class is optimal.

Compared with prior art, it is an advantage of the invention that：

Present invention improves over sorting algorithm model, wherein mainly with the addition of punishment in training pattern class object function The weighting coefficient of the factor, enhances the generalization ability of train classification models, while employs Non-linear Kernel function min (x (i), x_s (i)) so that the correspondence classification of text can accurately be found.Pass through distributed spark platforms Text Feature Extraction physical model point at the same time Into the extraction text entities model of multiple scenes, solve tradition extraction text entities training and computational load is bigger asks Inscribe, entity in the energy each text of rapid extraction, can more accurately extract text entities.

Brief description of the drawings

The invention will be further described with reference to the accompanying drawings and embodiments：

Fig. 1 is the flow chart of the industry text entities extracting method of the invention based on distributed platform.

Embodiment

To make the object, technical solutions and advantages of the present invention of greater clarity, with reference to embodiment and join According to attached drawing, the present invention is described in more detail.It should be understood that these descriptions are merely illustrative, and it is not intended to limit this hair Bright scope.In addition, in the following description, the description to known features and technology is eliminated, to avoid this is unnecessarily obscured The concept of invention.

Embodiment：

As shown in fig. 1, the industry text entities extracting method based on distributed platform, comprises the following steps：

(1) in text collection, the textual data of every profession and trade is obtained respectively by akka communication modules in spark Open Source Platforms It is believed that breath, the text data for needing to extract entity of monitoring device collection is transmitted on distributed Spark platforms.

(2) spark platform clusters are built, a wherein server is saved as management node, 4 servers as service Point.Dependence wherein between management node essential record data flow is simultaneously responsible for task scheduling and the new RDD of generation.Service Node is mainly the store function for realizing parser and data.

(3) existing text data set is trained to obtain relationship characteristic model, Ran Houli by deep learning neural net method With the relationship characteristic in the new text of relationship characteristic model extraction；

The generation of relationship characteristic model, specifically includes：

S21：First text is segmented with ansj segmenting methods of increasing income, each word is then calculated in institute by statistical There are the word frequency in text and the word frequency in current text, remove general auxiliary word, stop words, and the word that frequency is higher Language, then the word frequency relation in the word frequency in current text and all texts, extracts N number of primary word, while will be each Class is put into same file folder.

S22：Then the data characteristics that each vocabulary is 200 dimensions is randomly provided, so each samples of text can form N* The data of 200 dimensions.

S23：Using the relationship characteristic of each word as deep learning neutral net input node neuron, then pass through first Hidden layer carries out convolution, the second hidden layer carries out sub-sample and local average, the 3rd hidden layer carry out second of convolution, the 4th A hidden layer carries out second of sub-sample and the calculation of local average juice, full articulamentum, realizes that N*200 dimension datas turn to 1000 dimension datas Change.70% data wherein are used to training and 30% to be used to test.Gradual adjusting training is adjusted by multiple test accuracy The model of depth network generation, it is exactly the relational model for generating text that can finally be optimal network model.

(4) the relationship characteristic text data extracted is converted into text elasticity distribution formula RDD relationship characteristic text datas, Then it is divided into multiple RDD according to text contextual feature stream and carrys out burst processing.

(5) the elasticity distribution formula RDD features text data that will convert into passes through improved non-linear svm classifier Algorithm for Training For category feature model conversion out into category feature, trained data set is existing and categorized good industry text data Collection, while the advantage that quickly can be quickly calculated using spark distributed platforms, can to correcting industry text data set again With Fast Training, new category feature model is obtained.

Bad and sample text that is having apparent error will be extracted to be put into new class, progressively adjust test sample so that Test sample class is optimal；New text set can form different classifications, be feature by the distribution of spark platforms, can be fast All samples are extracted corresponding entity by speed by the physical model of corresponding types.It is corresponding more with increasing for classification Class physical model robustness is stronger, and it is better to extract entity accuracy.

The category feature model that improved non-linear svm classifier Algorithm for Training comes out, comprises the following steps：

Choose improved supporting vector machine model is as train classification models, its training pattern class object functionIts corresponding constraints is y=w' φ (x_i)+b+ε_i, pass through object function peace treaty Beam condition derives discriminant functionWherein weightsC is penalty factor, is one A adjustable parameter, i are 1 to arrive n training text number of samples, and w is weight vector, s_iIt is the Euclidean distance of positive sample and negative sample, And as the weighting coefficient of penalty factor in object function, b is threshold value, ε_iFor error, φ (x_i) it is Non-linear Kernel function；

Between penalty factor is set as 1 to 100, feature extraction is carried out to the positive negative sample of preprepared pedestrian, it is right Kernel function φ (the x answered_i) it is min (x (i), x_s(i)), wherein x (i), x_s(i) it is feature that the positive and negative sample extraction of any two arrives Vector；The label of positive sample is that value is 1, and negative sample label value is -1, and off-line training obtains the α of discriminant function_iAnd b, wherein sentencing Other functionIt is exactly corresponding non-linear SVM detection models；

By the result y for judging detection model_i, export the classification that respective value corresponds to text linguistic context.

(6) corresponding linguistic context physical model is found according to the category feature of text, and is extracted by trained physical model The solid data in the text of corresponding types is selected, wherein linguistic context physical model is existing by word2vec instruments of increasing income The physical model for the text that industry text data set is trained；

(7) when some scene text quantity exceedes threshold value T, the word2vec instruments re -training scenario entities will be used Model, can first save the data on distributed platform when no more than number of thresholds, wherein more than 10,000 samples of general T Quantity.

It should be appreciated that the above-mentioned embodiment of the present invention is used only for exemplary illustration or explains the present invention's Principle, without being construed as limiting the invention.Therefore, that is done without departing from the spirit and scope of the present invention is any Modification, equivalent substitution, improvement etc., should all be included in the protection scope of the present invention.In addition, appended claims purport of the present invention Covering the whole changes fallen into scope and border or this scope and the equivalents on border and repairing Change example.

Claims

1. a kind of industry text entities extracting method based on distributed platform, it is characterised in that comprise the following steps：

S01：Relationship characteristic model is obtained using deep learning neural metwork training text data set, and passes through relationship characteristic model Extract the relationship characteristic in target text；

S03：The category feature model extraction that data set in RDD is obtained by improved non-linear svm classifier Algorithm for Training Category feature；

S04：Corresponding linguistic context physical model is found according to the category feature of extraction, and is extracted and corresponded to by trained physical model Solid data in the text of category feature；

S05：Judge whether this quantity of corresponding linguistic context text exceedes given threshold T, if exceed threshold value T, the re -training linguistic context Physical model, using the solid data in the text of the corresponding category feature of physical model extraction of re -training, otherwise, preserves text This substance feature and text data.

2. the industry text entities extracting method according to claim 1 based on distributed platform, it is characterised in that described Step S01 is specifically included：

S11：Text is segmented by ansj segmenting methods of increasing income, word frequency of each word in all texts is counted and is working as Word frequency in preceding text, removes general auxiliary word, stop words and the high word of frequency, by all texts according in current text Word frequency and the word frequency in all texts relation, extract N number of word, by per one kind be placed on same file folder in；

S13：Using each word feature as deep learning neutral net input node neuron, then carried out by the first hidden layer Convolution, sub-sample and local average are carried out by the second hidden layer, and second of convolution is carried out by the 3rd hidden layer, by the Four hidden layers carry out second of sub-sample and the calculation of local average juice, full articulamentum, B dimension datas are converted text to, by more Secondary testing and debugging accuracy, obtains relationship characteristic model.

3. the industry text entities extracting method according to claim 1 based on distributed platform, it is characterised in that described Step S03 is specifically included：

S31：Adjust the weight and offset in non-linear svm classifier algorithm, the sample for making the relationship characteristic of input and having marked Feature error in setting range, preserve the category feature model of text；

S32：The disaggregated model method of selection is improved non-linear svm classifier algorithm, its training pattern class object function isWhereinForecast classification condition is y=w' φ (x_i)+b+ ε_i, obtain discriminant functionWherein weightsC is penalty factor, is a warp Test parameter, i is RDD numbers, and w is vectorial weight, s_iIt is the Euclidean distance of positive sample and negative sample in relationship characteristic, b is classification When threshold value, ε_iFor error, φ (x_i) it is Non-linear Kernel function；

S32：Gradually adjustment penalty factor, test select optimal penalty factor, wherein Non-linear Kernel function phi (x_i) it is min (x(i),x_s(i)), wherein x (i), x_s(i) it is feature vector that any two text relationship characteristic sample extraction arrives；Per class relation The label of feature samples is corresponding classification number, and the α of discriminant function is obtained by multiple off-line training_iAnd b, wherein discriminant functionIt is exactly corresponding category feature model.

4. the industry text entities extracting method according to claim 1 based on distributed platform, it is characterised in that described In step S03, bad and sample text that is having apparent error will be extracted and be put into new class, progressively adjust test sample so that Test sample class is optimal.