CN107943911A

CN107943911A - Data pick-up method, apparatus, computer equipment and readable storage medium storing program for executing

Info

Publication number: CN107943911A
Application number: CN201711155534.1A
Authority: CN
Inventors: 王昕�; 张剑; 黄石磊; 丁芳桂
Original assignee: PKU-HKUST SHENZHEN-HONGKONG INSTITUTION; SHENZHEN PRESS GROUP; Peking University Shenzhen Graduate School
Current assignee: PKU-HKUST SHENZHEN-HONGKONG INSTITUTION; SHENZHEN PRESS GROUP; Peking University Shenzhen Graduate School
Priority date: 2017-11-20
Filing date: 2017-11-20
Publication date: 2018-04-20

Abstract

The present invention relates to a kind of data pick-up method, apparatus, computer equipment and readable storage medium storing program for executing.A kind of data pick-up method, including：Pending data is split to obtain data set；Data set is inputted into default neural network model and obtains initial draw data and feature tag corresponding with initial draw data；According to preset rules template, corresponding target extract data are extracted from initial draw data；By target extract data and feature tag corresponding with target extract data association output.Above-mentioned data pick-up method, when extracting the data of different-format, from the limitation of decimation rule, the mapping relations of data set and feature tag are subjected to data pick-up by the decimation rule of customization, error rate when extracting data in different formats can be reduced, it is more preferable to extract effect.

Description

Data pick-up method, apparatus, computer equipment and readable storage medium storing program for executing

Technical field

The present invention relates to a kind of computer realm, more particularly to a kind of data pick-up method, apparatus, computer equipment and Readable storage medium storing program for executing.

Background technology

The fast development of modern information technologies and memory technology and the rapid sprawling of internet so that people are in daily life Work can touch the various information on network；In the big data epoch, what people lacked is not information, but confused from magnanimity Information useful, that people are of interest is obtained in complicated miscellaneous information；The advantage of Data Extraction Technology is to simplify nature language Say the process of processing, only focus on relevant information, and ignore unrelated content.

Traditional data pick-up method mainly passes through rule extraction, i.e., the information word of concern is identified and positioned, Then decimation rule is customized according to linguistic feature and relevant formatted data, its rule customized can only be directed to it is specific certain The data of specific format, and when in face of the data of different-format, usually because the segmentation errors of information and the list of decimation rule One property, causes the error rate of data pick-up very high.

The content of the invention

Based on this, it is necessary to for traditional data abstracting method error rate it is higher the problem of, there is provided a kind of data pick-up side Method, device, computer equipment and readable storage medium storing program for executing.

A kind of data pick-up method, including：

Pending data is split to obtain data set；

By the data set input default neural network model obtain initial draw data and with initial draw data pair The feature tag answered；

According to preset rules template, corresponding target extract data are extracted from the initial draw data；

By the target extract data and feature tag corresponding with target extract data association output.

In one of the embodiments, described the step of being split to obtain data set by pending data, including：

The pending data is split to obtain data set according to punctuation mark.

In one of the embodiments, it is described that the default neural network model of data set input is obtained into initial draw number According to this and the step of feature tag corresponding with the initial draw data, including：

The data set is inputted into default neural network model and obtains alternative label and corresponding standby with the alternative label Select data set；

Obtain the probability of the corresponding each alternative label of the data set；

The corresponding alternative label of maximum probability is chosen as feature tag, alternate data corresponding with the feature tag Collection is used as the initial draw data set.

In one of the embodiments, described the step of exporting the feature tag and the target extract data correlation Afterwards, further include：

When the feature tag of association output has mistake with the target extract data, receive for described default The adjust instruction of rule template；

According to the adjust instruction, the preset rules template is adjusted.

In one of the embodiments, the method further includes：

Sample data pre-processed according to preprocessing rule to obtain sample set；

Obtain the feature tag corresponding to each sample set；

The sample set and the feature tag are inputted into initial neural network model and obtain default neutral net mould Type.

In one of the embodiments, the method inputs the sample set and the feature tag to initial nerve net The step of default neural network model is obtained in network model, including：

The sample set is divided into training set and verification collects；

The training set and feature tag corresponding with the training set are inputted into initial neural network model and obtained Training neural network model；

The verification collection input to training neural network model is verified feature tag；

When verifying that corresponding with the training set feature tag of feature tag is inconsistent, then by with the training set pair The feature tag answered corrects the trained neural network model and obtains default neural network model.

In one of the embodiments, it is described to be pre-processed sample data according to preprocessing rule to obtain sample set Step, including：

Sample segmented to obtain individual character collection according to default participle logic；

Each word that the quantity for the character concentrated by presetting vector model and the individual character concentrates the individual character It is expressed as word vector；

The word that the individual character is concentrated is expressed as word sequence according to preset rules；

Sample set is obtained according to word sequence described in the word vector sum.

A kind of information extraction device, including：

Split module, for being split pending data to obtain data set；

Labeling module, for by the data set input default neural network model obtain initial draw data and with it is first Begin to extract the corresponding feature tag of data；

Abstraction module, for according to preset rules template, corresponding target extract to be extracted from the initial draw data Data；

Output module, for the target extract data and feature tag corresponding with the target extract data to be closed Connection output.

A kind of computer equipment, including memory, processor and storage can be run on a memory and on a processor Computer program, the processor realizes the step in the above method when performing described program.

A kind of readable storage medium storing program for executing, is stored thereon with computer program, which realizes above-mentioned side when being executed by processor Step in method.

Above-mentioned data pick-up method, apparatus, computer equipment and readable storage medium storing program for executing, first divide pending data Cut, then feature tag is added to the data set after segmentation by the self-aid learning ability of neural network model, and pass through regular mould Plate extracts the target extract data included in data set, finally exports the corresponding feature tag of target extract data. When extracting the data of different-format, as long as data to be extracted can be identified by computer, you can to pass through neutral net mould Type establishes the mapping relations of data set and feature tag, and from the limitation of decimation rule, then by data set and feature tag Mapping relations carry out data pick-up by the decimation rule of customization, can reduce error rate when extracting data in different formats, take out Take effect more preferable.

Brief description of the drawings

Fig. 1 is the flow chart of data pick-up method in an embodiment；

Fig. 2 is the flow chart of the step S104 in embodiment illustrated in fig. 1；

Fig. 3 is the flow chart of the pre-treatment step in an embodiment；

Fig. 4 is the flow chart of step S302 in embodiment illustrated in fig. 3；

Fig. 5 is the flow chart of step S104 in embodiment illustrated in fig. 1；

Fig. 6 is the structure diagram of data pick-up device in an embodiment；

Fig. 7 is the structure diagram of the computer equipment in an embodiment.

Embodiment

In order to make the purpose , technical scheme and advantage of the present invention be clearer, with reference to the accompanying drawings and embodiments, it is right The present invention is further elaborated.It should be appreciated that specific embodiment described herein is used only for explaining the present invention, and It is not used in the restriction present invention.

Before describing in detail according to an embodiment of the invention, it should be noted that, embodiment essentially consist in and data Abstracting method, device, computer equipment and the step of readable storage medium storing program for executing correlation and the combination of system component.Therefore, affiliated system System component and method and step are showed in position by ordinary symbol in the accompanying drawings, and are only indicated The details related with understanding the embodiment of the present invention, in order to avoid because for having benefited from those of ordinary skill in the art of the present invention Those obvious details have obscured the disclosure.

Herein, such as left and right, upper and lower, front and rear, first and second etc relational terms are used merely to area Divide an entity or action and another entity or action, and not necessarily require or imply and is between this entity or action any Actual this relation or order.Term " comprising ", "comprising" or any other variant are intended to non-exclusive inclusion, by This make it that including the process of a series of elements, method, article or equipment not only includes these key elements, but also comprising not bright The other element really listed, or be elements inherent to such a process, method, article, or device.

Fig. 1 is referred to, Fig. 1 provides the flow chart of data pick-up method in an embodiment, and method includes the following steps：

S102：Pending data is split to obtain data set.

Wherein, pending data is the target data of data pick-up to be carried out, including text data, PDF, picture etc.；Example Such as, resume file.Data set refers to the data that can be inputted in default neural network model, it can be character set, picture Collection or word and picture intersection etc..

Specifically, by pending data progress data split to obtain data set, i.e., by target data according to predetermined logic into Row segmentation obtains inputting the data set of default neural network model.The effect for carrying out data segmentation is prevented initial data one Secondary property input neural network model causes data congestion, and treatment effeciency is low, and when segmentation can according to preset rules into Row segmentation so that each data set after segmentation, which carries, to be had interrelated in content, facilitate neural network model to carry out next step Data processing.

For example, in a resume data pick-up, resume content to be extracted for " Li Ming graduates from Tsinghua University in 2000, So far it is engaged in XX work in certain company within 2001, the said firm once won XX rewards.", resume to be extracted is divided into 3 numbers first According to collection, i.e. data set 1：Li Ming graduates from Tsinghua University's Automation Specialty in 2000；Data set 2：2001 so far in certain company It is engaged in XX work；Data set 3：The said firm once won XX rewards.

S104：By data set input default neural network model obtain initial draw data and with initial draw data pair The feature tag answered.

Wherein, default neural network model is the neutral net mould with fixing process rule obtained by sample training Type.Initial draw data are that the data set split in step S102 is inputted to the data preset neural network model and obtained, Feature tag is the mark that the data that default neural network model concentrates data are done.

Specifically, after the default neural network model of data set input, neural network model is preset by trained rule, is known Do not go out with extracting the relevant data set of target data in data set, and pair with extracting target data relevant data addition feature mark Label, the data set of no feature tag is abandoned, and is only retained the data set that with the addition of feature tag, is obtained initial draw data.

For example, in the example that above-mentioned resume extracts, default neural network model has learnt how to know in training Not with the relevant data of target extract data, and the ability of feature tag is added to it.Such as, feature tag：Name, and default Neural network model is read：, Lee, king can identify that these data may represent surname when data, and according to Chinese Content of the logic after surname most likely name；Feature tag：Previous graduate college, reads in default neural network model： Tsinghua University, Wuhan University can identify that these data may represent previous graduate college when data.Again by above-mentioned 3 data sets After the default neural network model of input, feature tag-name is added on " Li Ming " in data set 1, in " Tsinghua University " Feature tag-previous graduate college is added, feature tag-major is added on " Automation Specialty "；In " certain public affairs of data set 2 Feature tag-work unit is added in department "；Data set 1 is added to 3 feature tags, and data set 2 with the addition of 1 feature tag, And data set 3 is not added with feature tag.After default neural network model, output data set 1 and 2 is extracted as initial target Data set, the feature tag added in data set 1 and 2 is feature tag corresponding with initial draw data, and will be not added with spy The data set 3 of sign label is cast out.

S106：According to preset rules template, corresponding target extract data are extracted from initial draw data.

Wherein, preset rules template is by the summary to sample data, is incorporated experience into and data characteristics hand-codings Rule template.Such as：Mr. Lin Fei, Chinese nationality, without the permanent right of residence overseas, nineteen sixty-eight birth.Chinese nationality is determined with nationality Position pers.country, it is male pers.male that Mr. Lin Fei, which positions Mr., and woods flies to be positioned as name pers.name, nineteen sixty-eight Birth positioning nineteen sixty-eight is birthday pers.birth.Matching process is carried out such as to the information in literary section after formulating the rule that becomes more meticulous Under：

Specifically, according to preset rules template, the process of target extract data is extracted from initial draw data, is to carry out The process of rule extraction.Rule extraction is artificially to be positioned according to keyword, word, string matching, regular expression, entity information A template is formulated etc. information to extract key message therein.Target extract data are the extractions by being obtained after rule extraction The extraction target of target, i.e. notebook data abstracting method.

Resume text will be obtained first to split resume text according to punctuation mark, exports some sections of text chunk P= {p_1,p_2…p_n}；Secondly, judge whether information word included in each text chunk P={ p_1, p_2 ... p_n } has resume The information word to be extracted in information, puts forward the text chunk of the information word included to form a new text chunk set P2 ={ pr_1, pr_2 ... pr_n }；Finally, judge after the information word comprising required extraction, with text chunk P2=pr_1, Pr_2 ... pr_n } with corresponding information metatag data to for training data, training corresponding information word decimation rule, and store Information.

S108：By target extract data and feature tag corresponding with target extract data association output.

Specifically, feature tag and target extract data are a pair of of data pair, and it is last to pass through notebook data abstracting method Output result is one group of group data pair in fact, output characteristic label with extract after target extract data data to rear, parse The data gone out simultaneously can be used for structure information database, this is undoubtedly for the application such as follow-up data mining, commending system It is very helpful.

If the preset rules template in this resume data pick-up is to extract name, previous graduate college, major, work The data template of office, then according to this preset rules template, to extract name, previous graduate college, the data of major, then Navigated to according to feature tag by target is extracted in data set 1, Li Ming, clearly is extracted from data set 1 according to preset rules template Hua Da, Automation Specialty；Similarly, when extracting the data of work unit, then navigated to according to feature tag by target is extracted In data set 2, certain company is extracted from data set 2 according to preset rules template；Finally by feature tag and the target after extraction Data correlation output is extracted, that is, is exported：Name-Li Ming, previous graduate college-Tsinghua University, major-Automation Specialty, work Unit-certain company.

Above-mentioned data pick-up method, when carrying out data pick-up, first forms pending data according to the segmentation of segmentation rule Data set, no matter it is same format, identical coding mode that data book to be extracted is no, as long as training default neural network model, And ensure that pending data can be identified by computer, you can enable data to be extracted by presetting neural network model Its content and feature tag are established into mapping relations, overcoming the rule customized in traditional rule extraction can only be directed to specifically The resume of specific format in certain, just seems unable to do what one wishes in the numerous and complicated resume text in face of magnanimity, not only needs constantly Addition modification and safeguard existing rule, and the problem of need to handle the conflict between rule.And although neutral net has Powerful ability of self-teaching goes the interrelated feature between automatic learning text biographic information member, but in learning process With certain error rate.The use of nerual network technique combination decimation rule then judges mould by regular masterplate in the application Biographic information member included in block, then accurately extracts the information word in resume, largely reduces neutral net again The error rate of technological learning so that the accuracy higher of extraction, it is more preferable to extract effect.

In a wherein embodiment, when extracting data, it can be split according to punctuation mark, or according to paragraph (i.e. enter key) is split, according to the font size of different color or word split etc..Specifically, by pending number According to being split to obtain data set and being split according to punctuation mark, punctuation mark is easier to identify, and meets lteral data Using rule.

It is in the step S108 in the above method that feature tag and target extract data correlation is defeated in a wherein embodiment Further included after the step of going out：When the feature tag of association output has mistake with target extract data, receive for default The adjust instruction of rule template；According to adjust instruction, preset rules template is adjusted.

Wherein, adjust instruction is when the feature tag of output has mistake with target extract data, according to output error The characteristics of preset rules template is made modification instruction.

Specifically, the output result of neutral net template may be different because of different pending datas, output The higher feature tag of error rate partially decidable corresponding with preset rules template is Universal Problems in actual use； Therefore can be with the error rate of everywhere in the feature tag of the output of statistical neural network template and target extract data, Ran Houzhen Progress in the corresponding preset rules template in the high part of feature tag error rate to output artificially adjusts.Such as it can pass through It whether there is mistake between the feature tag and target extract data of the association output of artificial or automatic identification, if automatic know Not, then can there will be the part of mistake to be labeled, such as the color of font, form etc. can be changed, then pass through user To adjust the partial error, and accommodation is carried out for preset rules template, so as to will not occur again identical next time Mistake, and the number used with system is more, the accuracy of system also can be higher, is conducive to the raising of follow-up efficiency.

Above-mentioned data pick-up method is during rule extraction, since decimation rule is according to keyword, word positioning, character The data such as String matching, regular expression, entity information formulate a template to extract critical data therein, this template differs Surely all data pick-up requirements be disclosure satisfy that, when there is mistake in feature tag and the target extract data of output, it is necessary to plus Entering part expert to intervene and verify, it is ensured that regular accuracy, i.e., change error rate eminence in preset rules template by expert, To ensure the accuracy of data pick-up.

Fig. 2 is referred to, in a wherein embodiment, the step S104 in the above method, i.e., input default god by data set The step of obtaining initial draw data and feature tag corresponding with initial draw data through network model can include：

S202：Data set is inputted into default neural network model and obtains alternate data collection and corresponding with alternate data collection Alternative label.

Wherein, alternative label is that data set is inputted default neural network model by presetting the neural network model study of Confucian classics The rule of inveterate habit may corresponding feature tag come the data judged in the data set of input.

Alternate data collection is according to alternative label, by presetting the obtained initial draw data of neural network model.Due to The not necessarily last feature tag of alternative label.

Specifically, when performing this step, only the data set for inputting default neural network model is labeled, Huo Zhetong One form modifying, and it is not required to cast out the data set for not marking alternative label, prevent because the mistake of alternative label causes data Lose.

S204：Obtain the probability of the corresponding each alternative label of data set.

Specifically, a linear layer, profit can be added in the output port of neural network model when selecting feature tag When output with the word vector sum word sequence of input terminal in neural network model, allowed in the linear layer of output terminal every kind of standby Label is selected to be counted, screened as the probability of the feature tag of final output, the final alternative mark for choosing maximum probability Label form a new meaning of one's words label as output characteristic, do entity mark using new output meaning of one's words label, obtain new Feature tag；A softmax grader can also be terminated in the output of neural network model, to predict each feature tag Weight.

S206：The corresponding alternative label of maximum probability is chosen as feature tag, alternative number corresponding with feature tag Initial draw data set is used as according to collection.

When actual use neural network model obtain initial draw data set, due to the number in the data set of input Huge according to measuring, word is as a kind of important information carrier, and due to the diversity of word, same word is under different linguistic context There is the different meanings, the difference of even question mark or fullstop can also influence the meaning of whole word, different especially for Chinese Punctuate mode can also excavate completely different information.Polysemy, colloquial style, technical term etc., multilingual mixing, sentence The expression of formula can all impact semantic parsing；And since the difference of various forms, and the expression-form of Chinese character be not solid It is fixed, most accurate initial draw data and feature tag corresponding with initial draw data can not be directly obtained, if in god Criterion through disposable fixation in network model, be easy to cause mass data and is judged invalid, cause loss of data, Therefore data set is first inputted into default neural network model and obtains alternative label and alternate data collection corresponding with alternative label, pass through The each alternative label of statistics as the probability of the feature tag of final output, can choose the alternative label of maximum probability as real The feature tag of border output, and initial draw data set is marked with this.

For example, in above-mentioned resume extracts, default neutral net is carried out for the data set 1 of input：" Li Ming 2000 Graduate from Tsinghua University ", surname may be identified as by presetting " Lee " word in the surname rule that neural network model learns in training Family name, then " Lee " word may then be identified as name with 1 or 2 word below, but individually only be agreed with the recognition result of first prediction It is fixed inaccurate, first using this result as the first alternative label, to predict second of recognition result i.e. with the second word " bright " together The alternative label that " Li Ming " is marked is then to repeat this process, until the last character, last " Li Ming " word is marked It is maximum for the weight of " name " this alternative label, therefore by " surname " this feature tag as last output as a result, and will Feature tag with data set 1 in " Li Ming " associate.

Only list the deterministic process for one layer of feature tag in this example, neural network model is defeated in practical application A linear layer is connected in outlet, this linear layer has deterministic process as multilayer, it is ensured that the adequacy that data judge.

It is this most probable feature tag is selected by statistical method and match initial draw data and with it is initial The corresponding feature tag of data is extracted, error rate of the neural network model when analyzing the data of multiple format can be reduced.

Fig. 3 is referred to, in a wherein embodiment, pre-treatment step is further included in above-mentioned data pick-up method, the pre- place It can be performed before the embodiment shown in Fig. 1 to manage step, which may comprise steps of：

S302：Sample data pre-processed according to preprocessing rule to obtain sample set.

Wherein, preprocessing rule is the rule by artificial settings, and preprocessed data is processed into and can be used in training just The data of beginning neural network model, i.e. sample set.Specifically, the default neural network model arrived for guarantee training can be more preferable Ground meets true situation about using, and preprocessing rule can be set as the data segmentation side identical with above-mentioned data pick-up method Formula.

S304：Obtain the feature tag corresponding to each sample set.

Wherein, the feature tag corresponding to sample set is to the spy corresponding to the data of the sample set obtained in step S302 Label is levied, feature tag and the mapping relations of sample intensive data are obtained by data mining.Data mining generally refers to The process of wherein information is hidden in by algorithm search from substantial amounts of data.Data mining is usually related with computer science, And pass through statistics, Data Environments, information retrieval, machine learning, expert system (relying on the past rule of thumb) and pattern All multi-methods such as identification realize above-mentioned target, and utilize the powerful learning ability of deep neural network to learn between data Interrelated feature, the model of generation are used to extract new data.

S306：Sample set and feature tag are inputted into initial neural network model and obtain default neural network model.

Wherein, initial neural network model is the blank template established in newly-built neural network model, this template Training need to be passed through, study there could be relevant, the as default god of initial neural network model in this step after training Through network model.In this application, the feature tag corresponding to the sample set and sample set after processing is inputted into initial nerve net After network model, by the learning ability of neural network model, the default neural network model of generation has identified input data The ability for the feature tag matched somebody with somebody.

The pretreatment process that neural network model is preset in a training is further included in above-mentioned data pick-up method, passes through this Pretreatment process trains the default neural network model of training and is used for actual data pick-up；In this application, by after processing After sample set inputs initial neural network model with the feature tag corresponding to sample set, pass through the study energy of neural network model Power, the default neural network model of generation have the ability of the feature tag of identified input Data Matching.

In a wherein embodiment, sample data is pre-processed to obtain according to preprocessing rule in above-mentioned steps S302 The step of sample set, is further comprising the steps of, refers to Fig. 4：

S402：Sample segmented to obtain individual character collection according to default participle logic.

Specifically, after carrying out data extraction and obtaining pending data, if pending data text is D={ D_1 ... D_ N }, wherein D_n represents nth data text.Then resume data text D={ D_1 ... D_n } handled, split.

The word inside text chunk, sentence are treated as individual character one by one with trained participle model, obtain list Word collection w={ wd_1 ... wd_n }, wherein wd_n represent n-th of word.

S404：Each word that the quantity for the character concentrated by presetting vector model and individual character concentrates individual character represents For word vector.

Wherein, it is the advance trained vector model based on word or word to preset vector model, can be using ***'s The GloVe of Skip-gram models and Stanford in Word2Vec, carries out vector representation by the word of pending data and forms dimension D=N dimensions are spent, for initializing neutral net word vector table, are then finely adjusted in nerve network system；A kind of base of word vector In the character representation vector of word, that is to say, that gone to represent each word with the vector of fixed dimension, wherein removing table with d=100 dimensions It is the parameter that engineering experience takes to state each word.

For example, for the individual character collection w={ wd_1 ... wd_n } split in above-mentioned S402 steps, reading trains in advance The vector model based on word or word, each word in individual character collection w is subjected to vector representation and forms dimension d=N dimension (such as 100 dimension) Word vector v={ v_1 ... v_n }.

S406：The word that individual character is concentrated is expressed as word sequence according to preset rules.

Wherein, context of co-text feature of the word sequence for description language fragment, most basic observation sequence, just because of The own characteristic of Chinese character and its arrangement of fixation so that each Chinese character in sequence shows certain role characteristic.Such as one M individual character collection is included in a text chunk, N represents the sum of word included in each individual character collection, then this text chunk can use M The sequence that N-dimensional (0,1) vector is formed goes to represent, and each word is with 1 N-dimensional (0,1) vector representation；" the explosion-proof section of electric light The corresponding word sequence of skill limited company " is { electricity, light, prevents, quick-fried, section, skill, stock, part, has, limit, public, department }.

Specifically, for the individual character collection w={ wd_1 ... wd_n } split in above-mentioned S402 steps, according to default rule Then, each individual character in individual character collection w is expressed as word sequence, as B=B_1, B_2, B_3 ... B_4 | n>0 }, wherein B_n is Chinese character or symbol string.

S408：Sample set is obtained according to word vector sum word sequence.

Specifically, the word vector training set (Train set) and the training set of word sequence data to be extracted being divided into (Train set) while as the input feature vector of default neural network model, word vector training set here and the instruction of word sequence Practicing collection can be powerful using deep neural network as the sample set of training neural network model by processing such as unified forms Ability of self-teaching go automatically study pending data between interrelated feature, practise default neutral net mould so as to select Type.

Wherein, the word vector training set and the training set of word sequence data to be extracted being divided into are at the same time as default nerve During the input feature vector of network model, it is engineering experience parameter that can set dropout=N, wherein N, prevents over-fitting.And observation instruction Practice neural network model sample set, then by the format analysis processing of sample intensive data into actual pending data form one Cause, convenient actual data pick-up.

Chinese organize word in, word has very strong flexibility so that vocabulary enormous amount, while lexical feature enrich without Easily study, and regard keyword as word combination make it that vocabulary role is extremely complex.Such as the part of keyword can It can be split in other non-key words.That is obtain more after obtaining aspect ratio character segmentation after word segmentation, greatly improve The complexity of machine learning.The quality that word represents will directly affect the recognition result of biographic information member.Under Chinese environment, First have to segment Chinese, it is always in the industry cycle a bottleneck problem then to segment, and early period, the quality of participle effect will be straight The identification of the name entity influenced below connect, causes hydraulic performance decline.So we are used as spy using word vector sum word sequence The input of sign, the problem of effectively avoiding participle.The feature obtained due to obtaining aspect ratio word segmentation after character segmentation will be lacked, together When also reduce the influence of participle, greatly reduce the complexity of machine learning.

Fig. 5 is referred to, in a wherein embodiment, data set is inputted default neutral net mould by the S104 in the above method It is further comprising the steps of that type obtains the step of initial draw data and feature tag corresponding with initial draw data：

S502：Sample set is divided into training set and verification collects.

Wherein, training set is to cause the model on verification collection for training initial model, then by adjusting model parameter Mark performance be optimal.

Specifically, the sample set by pretreatment is split, the N% of sample set points are training set, the N% of training set Collect for verification, the value that wherein N is taken by engineering experience.

S504：Training set and feature tag corresponding with training set are inputted into initial neural network model and trained Neural network model.

Specifically, in training neural network model, it is necessary to initial neural network model first be established, then by training data Input into the default of line discipline in nerve network system, training data here refers to training set and feature corresponding with training set Label, by the learning ability of neutral net, obtains training neural network model, training neural network model here has had After having input data set, according to the data set of input and itself default rule, output characteristic label and corresponding with feature tag Data set.

S506：Verification collection input to training neural network model is verified feature tag.

Specifically, the training neural network model obtained in step S504 might not have complete accuracy, for drop Error of the low trained neural network model in actual data pick-up flow, we extract a certain number of samples in training set This conduct verification collection, then will verify that collection inputs the accuracy that trained neural network model examines training neural network model again, Verification feature tag corresponding with verification collection has been obtained in this step.

S508：When verifying that feature tag feature tag corresponding with training set is inconsistent, then by corresponding with training set Feature tag amendment training neural network model obtain default neural network model.

When above-mentioned verification feature tag feature tag corresponding with training set is inconsistent, that is, illustrate training neutral net Model generates error in the application, and the type of error of the feature tag of analysis verification at this time, super the reason for finding generation mistake, is repaiied Positive training neural network model, revised trained neural network model are the default nerve used in real data extraction Network model.

Verification process when above-mentioned step is training default neural network model, is tested by once even multiple Card, until ensureing that default neural network model is applicable in actual data pick-up flow, ensures the accurate of data pick-up Property.

Continue the example that above-mentioned resume extracts, by the resume sample data after segmentation be divided into 90% training resume sample and Data in training resume sample are carried out feature tag association, input initial neutral net mould by 20% verification resume sample Learnt in type, obtain training neural network model, then trained the sample input of verification resume in neural network model, inspection The accuracy that trained neural network model is exported for verifying the feature tag of sample data is tested, if output is wrong, is proved Training network model is problematic, then trains neural network model by feature tag amendment corresponding with resume training sample, repair The neural network model just obtained afterwards can linked character label and resume sample data, this model be correctly default neutral net Model

In one of the embodiments, reference can be made to Fig. 6, there is provided the structure diagram of data pick-up device in an embodiment, The Video Rendering device 600 includes：

Split module 602, for being split pending data to obtain data set.

Labeling module 604, for by data set input default neural network model obtain initial draw data and with it is first Begin to extract the corresponding feature tag of data.

Abstraction module 606, for according to preset rules template, corresponding target extract number to be extracted from initial draw data According to.

Output module 608, for target extract data and feature tag corresponding with target extract data association are defeated Go out.

In one of the embodiments, the segmentation module 602 in above-mentioned data pick-up device can be also used for according to punctuate Symbol is split pending data to obtain data set.

In one of the embodiments, the labeling module 604 in above-mentioned data pick-up device can include：

Alternative unit, alternate data collection and and alternate data are obtained for data set to be inputted default neural network model Collect corresponding alternative label.

Statistic unit, for obtaining the probability of the corresponding each alternative label of alternate data collection.

Unit is chosen, the corresponding alternative label of the probability for choosing maximum is corresponding with feature tag as feature tag Alternate data collection as initial draw data set.

In one of the embodiments, above-mentioned data pick-up device can also include：

Garbled-reception module, for by feature tag and target extract data correlation output after, and associate output spy When levying label and target extract data in the presence of mistake, the adjust instruction for preset rules template is received.

Module is adjusted, for according to adjust instruction, being adjusted to preset rules template.

Acquisition module, for being pre-processed sample data according to preprocessing rule to obtain sample set.

Tag definition module, for obtaining the feature tag corresponding to each sample set.

Forming module, default nerve net is obtained for inputting sample set and feature tag into initial neural network model Network model.

Sample decomposition unit, for sample set to be divided into training set and verification collection.

Training unit, for inputting training set and feature tag corresponding with training set into initial neural network model Obtain training neural network model.

Tag unit is verified, for verification collection input to training neural network model to be verified feature tag.

Authentication unit, for when verifying that corresponding with the training set feature tag of feature tag is inconsistent, then by with instruction Practice the corresponding initial neural network model of feature tag amendment of collection and obtain default neural network model.

In one of the embodiments, above-mentioned collecting unit can include：

Subelement is segmented, for sample being segmented according to default participle logic to obtain individual character collection.

To quantum boxes, the quantity for the character by presetting vector model and individual character concentration concentrates individual character every One word is expressed as word vector.

Sequence subelement, for the word that individual character is concentrated to be expressed as word sequence according to preset rules.

Subelement is gathered, for obtaining sample set according to word vector sum word sequence.

It is above-mentioned to limit the restriction that may refer to above in connection with data pick-up method on the specific of data pick-up device, This is repeated no more.

In one of the embodiments, Fig. 7 is referred to, there is provided the computer equipment of data pick-up is performed in an embodiment Structure diagram, which can perform data pick-up equipment, be General Server or other any suitable calculating Machine equipment, including memory, processor, operating system, database and storage can be run on a memory and on a processor Data extractor, wherein memory can include built-in storage, and processor realizes following step when performing data extractor Suddenly：Pending data is split to obtain data set；Data set is inputted into default neural network model and obtains initial draw number According to this and feature tag corresponding with initial draw data；Mesh corresponding with preset rules template is extracted from initial draw data Mark extracts data；By target extract data and feature tag corresponding with target extract data association output.

In one of the embodiments, realized during processor executive program split pending data is counted It can include according to the step of collection：Pending data split according to punctuation mark to obtain data set.

In one of the embodiments, that is realized during processor executive program inputs data set default neutral net mould Type, which obtains the step of initial draw data and feature tag corresponding with initial draw data, to be included：Data set is inputted Default neural network model obtains alternate data collection and alternative label corresponding with alternate data collection；Obtain alternate data set pair The probability for each alternative label answered；The corresponding alternative label of maximum probability is chosen as feature tag, with feature tag pair The alternate data collection answered is as initial draw data set.

In one of the embodiments, that is realized during processor executive program closes feature tag and target extract data After the step of connection output, further include：When the feature tag of association output has mistake with target extract data, reception is directed to The adjust instruction of preset rules template；According to adjust instruction, preset rules template is adjusted.

In one of the embodiments, following steps are also realized when which is executed by processor：According to preprocessing rule Sample data is pre-processed to obtain sample set；Obtain the feature tag corresponding to each sample set；By sample set and feature Label, which is inputted into initial neural network model, obtains default neural network model.

In one of the embodiments, that is realized during processor executive program inputs sample set and feature tag to first The step of default neural network model is obtained in beginning neural network model includes：Sample set is divided into training set and verification collects；Will Training set and feature tag corresponding with training set, which are inputted into initial neural network model, to be obtained training neural network model；Will Verification collection input to training neural network model is verified feature tag；When verification feature tag feature corresponding with training set When label is inconsistent, then default neutral net is obtained by the initial neural network model of feature tag amendment corresponding with training set Model.

In one of the embodiments, realized during processor executive program according to preprocessing rule by sample data into The step of row pretreatment obtains sample set, including：Sample segmented to obtain individual character collection according to default participle logic；By pre- If each word that individual character is concentrated is expressed as word vector by the quantity for the character that vector model and individual character are concentrated；According to default rule The word that individual character is concentrated then is expressed as word sequence；Sample set is obtained according to word vector sum word sequence.

It is above-mentioned to limit the restriction that may refer to above in connection with data pick-up method on the specific of computer equipment, herein Repeat no more.

In one embodiment, please continue to refer to Fig. 7, there is provided a kind of computer-readable storage medium, is stored thereon with computer Program, the program realize following steps when being executed by processor：Pending data is split to obtain data set；By data set The default neural network model of input obtains initial draw data and feature tag corresponding with initial draw data；Taken out from initial Target extract data corresponding with preset rules template are extracted in access in；By target extract data and with target extract data Corresponding feature tag association output.

In one of the embodiments, that is realized during processor executive program inputs data set default neutral net mould Type obtains the step of initial draw data and feature tag corresponding with initial draw data, including：Data set input is pre- If neural network model obtains alternate data collection and alternative label corresponding with alternate data collection；Alternate data collection is obtained to correspond to Each alternative label probability；The corresponding alternative label of maximum probability is chosen as feature tag, it is corresponding with feature tag Alternate data collection as initial draw data set.

In one of the embodiments, that is realized during processor executive program inputs sample set and feature tag to first The step of default neural network model is obtained in beginning neural network model, including：Sample set is divided into training set and verification collects；Will Training set and feature tag corresponding with training set, which are inputted into initial neural network model, to be obtained training neural network model；Will Verification collection input to training neural network model is verified feature tag；When verification feature tag feature corresponding with training set When label is inconsistent, then default neutral net is obtained by the initial neural network model of feature tag amendment corresponding with training set Model.

It is above-mentioned to limit the restriction that may refer to above in connection with Video Rendering method on the specific of computer-readable storage medium, Details are not described herein.

One of ordinary skill in the art will appreciate that realize all or part of flow in above-described embodiment method, being can be with The program for instructing relevant hardware to complete by computer program can be stored in a non-volatile computer and calculating can be read In machine storage medium, the program is upon execution, it may include such as the flow of the embodiment of above-mentioned each method.Wherein, it is computer-readable The computer-readable storage medium taken can be magnetic disc, CD, read-only memory (Read-Only Memory, ROM) etc..

Each technical characteristic of above example can be combined arbitrarily, to make description succinct, not to above-described embodiment In each technical characteristic it is all possible combination be all described, as long as however, lance is not present in the combination of these technical characteristics Shield, is all considered to be the scope of this specification record.

Above example only expresses the several embodiments of the present invention, its description is more specific and detailed, but can not Therefore it is construed as limiting the scope of the patent.It should be pointed out that for those of ordinary skill in the art, On the premise of not departing from present inventive concept, various modifications and improvements can be made, these belong to protection scope of the present invention. Therefore, the protection domain of patent of the present invention should be determined by the appended claims.

Claims

A kind of 1. data pick-up method, it is characterised in that including：

Pending data is split to obtain data set；

The data set is inputted into default neural network model and obtains initial draw data and corresponding with initial draw data Feature tag；

Target extract data corresponding with the preset rules template are extracted from the initial draw data；

By the target extract data and feature tag corresponding with target extract data association output.
2. according to the method described in claim 1, it is characterized in that, described split pending data to obtain data set Step, including：

The pending data is split to obtain data set according to punctuation mark.
3. according to the method described in claim 1, it is characterized in that, described input default neural network model by the data set The step of obtaining initial draw data and feature tag corresponding with the initial draw data, including：

The data set is inputted into default neural network model and obtains alternate data collection and corresponding with alternate data collection alternative Label；

Obtain the probability of the corresponding each alternative label of the alternate data collection；

The corresponding alternative label of maximum probability is chosen as feature tag, alternate data collection corresponding with the feature tag to make For the initial draw data set.
It is 4. according to the method described in claim 1, it is characterized in that, described by the feature tag and the target extract data After the step of association output, further include：

When the feature tag of association output has mistake with the target extract data, reception is directed to the preset rules The adjust instruction of template；

According to the adjust instruction, the preset rules template is adjusted.
5. method according to any one of claims 1 to 4, it is characterised in that the method further includes：

Sample data pre-processed according to preprocessing rule to obtain sample set；

Obtain the feature tag corresponding to each sample set；

The sample set and the feature tag are inputted into initial neural network model and obtain default neural network model.
6. according to the method described in claim 5, it is characterized in that, the method is defeated by the sample set and the feature tag The step of entering into initial neural network model to obtain default neural network model, including：

The sample set is divided into training set and verification collects；

The training set and feature tag corresponding with the training set are inputted into initial neural network model and trained Neural network model；

The verification collection input to training neural network model is verified feature tag；

When verifying that feature tag feature tag corresponding with the training set is inconsistent, then by corresponding with the training set Feature tag corrects the trained neural network model and obtains default neural network model.
7. according to the method described in claim 5, it is characterized in that, described located sample data according to preprocessing rule in advance The step of reason obtains sample set, including：

Sample segmented to obtain individual character collection according to default participle logic；

Each word that the quantity for the character concentrated by presetting vector model and the individual character concentrates the individual character represents For word vector；

The word that the individual character is concentrated is expressed as word sequence according to preset rules；

Sample set is obtained according to word sequence described in the word vector sum.
A kind of 8. information extraction device, it is characterised in that including：

Split module, for being split pending data to obtain data set；

Labeling module, initial draw data are obtained and with initially taking out for the data set to be inputted default neural network model Access is according to corresponding feature tag；

Abstraction module, for according to preset rules template, corresponding target extract data to be extracted from the initial draw data；

Output module, for the target extract data and feature tag corresponding with target extract data association are defeated Go out.
9. a kind of computer equipment, including memory, processor and storage can be run on a memory and on a processor Computer program, it is characterised in that the processor is realized in any the method for claim 1 to 7 when performing described program The step of.
10. a kind of readable storage medium storing program for executing, is stored thereon with computer program, it is characterised in that when the program is executed by processor Realize the step in any the method for claim 1 to 7.