CN109165279A

CN109165279A - information extraction method and device

Info

Publication number: CN109165279A
Application number: CN201811048118.6A
Authority: CN
Inventors: 张淼; 刘诗媛; 洪思睿; 徐宇垚; 范贤; ***
Original assignee: Shenzhen Het Data Resources and Cloud Technology Co Ltd
Current assignee: Shenzhen Het Data Resources and Cloud Technology Co Ltd
Priority date: 2018-09-06
Filing date: 2018-09-06
Publication date: 2019-01-08

Abstract

This application discloses a kind of information extraction method and devices.This method comprises: obtaining the first data, which is the data of targeted efficacy to be extracted；First data are input in two-way shot and long term memory LSTM network, the second data are exported；Wherein, which is used to indicate the label of first data；Target labels are extracted from second data, and obtain the targeted efficacy according to the target labels；Wherein, there is corresponding relationship between the target labels and the targeted efficacy.Correspondingly, additionally providing corresponding device.Using the application, the efficiency of information extraction can be effectively improved.

Description

Information extraction method and device

Technical field

This application involves field of computer technology more particularly to a kind of information extraction methods and device.

Background technique

There is more beauty product on the market at present, different websites is also not quite similar to the description of cosmetics.In order to The effect of excavating different cosmetics generally carries out entity extraction using rule or regular expression, i.e., using rule or just Then expression formula describes the effect of cosmetics from extracting in the description corpus largely to cosmetics.

However, extraction efficiency can be made low using above method.

Summary of the invention

This application provides a kind of information extraction method and devices, can effectively improve the efficiency of information extraction.

In a first aspect, the embodiment of the present application provides a kind of information extraction method, comprising:

The first data are obtained, first data are the data of targeted efficacy to be extracted；

First data are input in two-way shot and long term memory network LSTM, the second data are exported；Wherein, described Two data are used to indicate the label of first data；

Target labels are extracted from second data, and obtain the targeted efficacy according to the target labels；Wherein, There is corresponding relationship between the target labels and the targeted efficacy.

In the embodiment of the present application, come using two-way shot and long term memory (long short term memory, LSTM) network The label of the data of targeted efficacy to be extracted is obtained, to extract mesh corresponding with targeted efficacy using the second obtained data Label is marked, and then obtains the targeted efficacy.Implement the embodiment of the present application, not only realizes simply, but also targeted efficacy can be improved Extraction efficiency.

In one possible implementation, the length of first data is identical as the length of second data.

In the embodiment of the present application, the length by guaranteeing the first data is identical as the length of the second data, can effectively ensure that Information extraction device obtains the label of each word in the first data, can effectively avoid two-way LSTM network not to the first data Situation about having marked occurs, to also improve the efficiency for extracting targeted efficacy.

In one possible implementation, first data include the natural language description of product, the target function The effect of effect includes the product.

In the embodiment of the present application, product may include cosmetic product, i.e., by handling cosmetic product, can obtain the change The effect of adornment product.Implement the embodiment of the present application, the extraction accuracy rate of the cosmetic industry in corpus can be improved, avoid manpower Exhaustive regular expression.

In one possible implementation, described that first data are input to two-way shot and long term memory LSTM network Before, the method also includes:

Obtain training corpus；

Each word in the training corpus is labeled, the label of each word in the training corpus is obtained；

Each word in the training corpus is encoded, the one-dimensional vector of each word in the training corpus is obtained；

The one-dimensional vector of each word in the label and the training corpus of word each in the training corpus is input to The two-way LSTM network, the training two-way LSTM network；Wherein, the label of each word and the instruction in the training corpus Practicing has corresponding relationship between the one-dimensional vector of each word in corpus.

In the embodiment of the present application, two-way LSTM network is trained using training corpus, to train this two-way Weight parameter in LSTM, and then information extraction side provided by the embodiment of the present application is realized using the two-way LSTM network Method improves the feasibility of the information extraction method.

In one possible implementation, the acquisition training corpus, comprising:

The training corpus is obtained using web crawlers method.

In the embodiment of the present application, the method for utilizing web crawlers collects training corpus, it is ensured that can be collected into it is a large amount of and Sufficient training corpus, improves the training effect of two-way LSTM network.

Second aspect, the embodiment of the present application provide a kind of information extraction device, comprising:

Acquiring unit, for obtaining the first data, first data are the data of targeted efficacy to be extracted；

Input-output unit, for first data to be input in two-way shot and long term memory LSTM network, output the Two data；Wherein, second data are used to indicate the label of first data；

Extracting unit, for extracting target labels from second data, and according to target labels acquisition Targeted efficacy；Wherein, there is corresponding relationship between the target labels and the targeted efficacy.

In one possible implementation, the acquiring unit is also used to obtain training corpus；

Described device further include:

Unit is marked, for being labeled to each word in the training corpus, is obtained each in the training corpus The label of word；

Coding unit obtains each in the training corpus for encoding to each word in the training corpus The one-dimensional vector of word；

Training unit, one for each word in the label and the training corpus by word each in the training corpus Dimensional vector is input to the two-way LSTM network, the training two-way LSTM network；Wherein, each word in the training corpus There is between the one-dimensional vector of each word corresponding relationship in label and the training corpus.

In one possible implementation, the acquiring unit is specifically used for using described in the acquisition of web crawlers method Training corpus.

The third aspect, the embodiment of the present application also provides a kind of information extraction devices, comprising: processor, memory and defeated Enter output interface, the processor and the memory, the input/output interface are interconnected by route；Wherein, the storage Device is stored with program instruction；When described program instruction is executed by the processor, execute the processor such as first aspect institute The corresponding method stated.

Fourth aspect, the embodiment of the present application provide a kind of computer readable storage medium, the computer-readable storage Computer program is stored in medium, the computer program includes program instruction, and described program instruction is worked as to be filled by information extraction When the processor set executes, the processor is made to execute method described in first aspect.

5th aspect, the embodiment of the present application provides a kind of computer program product comprising instruction, when it is in computer When upper operation, so that computer executes method described in above-mentioned first aspect.

Detailed description of the invention

Technical solution in ord to more clearly illustrate embodiments of the present application or in background technique below will be implemented the application Attached drawing needed in example or background technique is illustrated.

Fig. 1 is a kind of flow diagram of information extraction method provided by the embodiments of the present application；

Fig. 2 is a kind of flow diagram of the training method of two-way LSTM network provided by the embodiments of the present application；

Fig. 3 is a kind of schematic diagram of a scenario of two-way LSTM network provided by the embodiments of the present application；

Fig. 4 is a kind of structural schematic diagram of information extraction device provided by the embodiments of the present application；

Fig. 5 is the structural schematic diagram of another information extraction device provided by the embodiments of the present application；

Fig. 6 is the structural schematic diagram of another information extraction device provided by the embodiments of the present application.

Specific embodiment

In order to keep the purposes, technical schemes and advantages of the application clearer, below in conjunction with attached drawing to the application make into One step it is described in detail.

It is a kind of flow diagram of information extraction method provided by the embodiments of the present application referring to Fig. 1, Fig. 1, which takes out Method is taken to can be applied to information extraction device, which can be terminal device, such as mobile phone, plate, computer, or Person, the information extraction device can be also server etc., and the embodiment of the present application is specially which kind of sets for the information extraction device It is standby to be not construed as limiting.Optionally, which can also be independent chip etc., and the embodiment of the present application is not construed as limiting.Such as Shown in Fig. 1, the information extraction method can include:

101, the first data are obtained, above-mentioned first data are the data of targeted efficacy to be extracted.

In the embodiment of the present application, which can be the data of arbitrary natural language description, specifically, first number It may include the natural language description of cosmetics according to the natural language description that may include product, such as first data.Above-mentioned target function Effect include the said goods the effect of, as should including cosmetics natural language description data in may include cosmetics the effect of such as Any one or more effect in whitening, moisturizing, oil-control, anti-aging and nti-freckle etc..

It is understood that the first data that the embodiment of the present application specifically wherefrom obtains the information extraction device do not limit It is fixed, the description on cosmetic bottle can be shot by camera such as the information extraction device, thus after identifying first data, Obtain first data.Or the information extraction device can directly obtain first data from webpage.Or the letter First data etc. that breath draw-out device is obtained from the device that other are used to obtain the first data, the embodiment of the present application is not made It limits

102, above-mentioned first data are input in two-way shot and long term memory LSTM network, export the second data；Wherein, on The second data are stated for indicating the label of above-mentioned first data.

In the embodiment of the present application, which can have been trained for the information extraction device from what other devices obtained Good network, or the network etc. of the information extraction device oneself training, the embodiment of the present application is not construed as limiting.

In the embodiment of the present application, LSTM network is a kind of time recurrent neural network, can be used for handling and predicted time sequence Relatively long event is spaced and postponed in column.Such as the LSTM network can learn documentation summary, speech recognition and handwriting recognition Etc., specifically, whether the information that the LSTM network can judge automatically in the addition network is useful.That is, the LSTM net It can be by being arranged for judging the whether useful unit (cell) of the information being added in the network, so that information enters in network When in the LSTM network, can be judged whether according to rule it is useful, such as meet the LSTM network algorithm certification information stayed Under, the information not being inconsistent passes into silence.Specifically, the cell may include three fan doors, i.e., input gate (input gate), forget door (forget gate) and out gate (output gate).Wherein, input gate can be used for remembering certain information of input, forget door It can be used for selecting to forget past certain information, and out gate can close certain information of input with the certain information forgotten And and then output information.

As a result, in the embodiment of the present application, when the first data are input to the LSTM network, which can pass through cell Judge whether to have in first data information for meeting algorithm certification, useful letter can be left if any the then LSTM network Breath, as the first data of memory input may be selected in input gate.To which information of the LSTM when that will train merges with the first data Later, i.e., the information that the LATM network can be useful to this is analyzed or is learnt, and then exports the second data by out gate. It is understood that being a kind of mode of LSTM network provided by the embodiments of the present application above, in the concrete realization, which may also have There are many variations or improvement, therefore, LSTM network illustrated above should not be interpreted as to the restriction to the embodiment of the present application.

It is understood that as to how training or learning the LSTM network, alternatively, can also claim how to make the LSTM network defeated Useful information out with input is corresponding as a result, can refer to training method shown in Fig. 2, is no longer described in detail one by one here.

In the embodiment of the present application, which can be used for indicating the label of the first data, which can be with number Form.Optionally, which can also be used in the effect of indicating the first data.For example, which can be " salicylic Whitening effect is fine ", whitening is indicated with number 1, and other data are indicated with 0, then first data are input to two-way LSTM In, the second data of output can be such as " 0000110000 ".That is 11 in the second data can indicate whitening.It is understood that the above is only A kind of example should not be construed as the restriction to the embodiment of the present application.

Optionally, in order to guarantee that two-way LSTM network can be labeled each word of the first data of input, this Apply in embodiment, the length of above-mentioned first data is identical as the length of above-mentioned second data.That is, by guaranteeing first The length of data is identical as the length of the second data, it can be achieved that being labeled to each word in the first data, to be somebody's turn to do The label of each word of first data.

103, target labels are extracted from above-mentioned second data, and obtain above-mentioned targeted efficacy according to above-mentioned target labels； Wherein, there is corresponding relationship between above-mentioned target labels and above-mentioned targeted efficacy.

In the embodiment of the present application, target labels can be corresponding with targeted efficacy, such as still by taking above-mentioned example as an example, the second of output Data are " 0000110000 ", then target labels can be 11, since number 1 represents whitening, can be obtained according to the target labels To targeted efficacy.

Again for example, whitening is indicated with number 1, indicate oil-control with number 2, indicate moisturizing with number 3, with digital 4 tables Show anti-aging, number 0 indicates other, it is appreciated that this other be expressed as and the data except whitening, oil-control, moisturizing and anti-aging. If the first data are " resist years invasion, alleviate dry skin problem, light take off microgroove wrinkle ", then first data are input to After two-way LSTM network, the second data of output can be " 00000,0003300,440000 ", therefrom extract target labels 33 and 44, then it can learn that described in first data be moisturizing and anti-aging.That is, although unknown in the first data Moisturizing and anti-aging are really recorded, but according to semantic analysis it is found that the meaning expressed in first data is moisturizing and anti-ageing Always, then the two-way LSTM network by analyzing the first data of input, it can be deduced that corresponding label (including target mark Label).Such as two-way LSTM can be by study sentence structure and word or word information, to provide corresponding label.

It is understood that the embodiment of the present application specifically indicates to be not construed as limiting in what manner for the second data.

In the embodiment of the present application, the label of the data of targeted efficacy to be extracted is obtained using two-way LSTM network, thus Target labels corresponding with targeted efficacy are extracted using the second obtained data, and then obtain the targeted efficacy.Implement this Shen Please embodiment, not only realize simple, but also the extraction efficiency of targeted efficacy can be improved.

It below will be by taking the two-way LSTM network of information extraction device training as an example, to illustrate how the information extraction device is Training network.Referring to fig. 2, Fig. 2 is that a kind of process of method for training two-way LSTM network provided by the embodiments of the present application is shown It is intended to, this method, which can be applied to information extraction device, which can be terminal device to be server etc., The embodiment of the present application is not construed as limiting.As shown in Fig. 2, the training method can include:

201, training corpus is obtained.

In the embodiment of the present application, training corpus may include natural language description relevant to cosmetic product.Specifically, this Shen Please be in embodiment, above-mentioned acquisition training corpus, comprising:

Above-mentioned training corpus is obtained using web crawlers method.

Wherein, web crawlers, also known as webpage spider, network robot etc., be it is a kind of according to certain rules, automatically Ground grabs program or script on network.Therefore, the embodiment of the present application is obtained from network by the method using web crawlers A large amount of training corpus is taken, the training effect of two-way LSTM network can be effectively increased.

202, each word in above-mentioned training corpus is labeled, obtains the label of each word in above-mentioned training corpus.

In the embodiment of the present application, each word in training corpus is labeled, may include in the training corpus Each word is labeled according to the rule of pre-set label and the corresponding relationship of effect, is divided to realize each word Class.Wherein, the label of each word can indicate in digital form in the training corpus, such as whitening be indicated with number 1, with number 2 indicate oil-control, indicate moisturizing with number 3, indicate that anti-aging, number 0 indicate other etc., the embodiment of the present application pair with number 4 It is without being limited thereto in the description of effect.

It is understood that being labeled in the embodiment of the present application to each word in training corpus, obtain each in training corpus The label of word may include following two situation:

Situation one,

It include the natural language description of effect in training corpus, that is to say, that in this case, directly just being wrapped in training corpus Certain effects, such as whitening, oil-control are included, at this moment, information extraction device can directly be labeled the training corpus.

Situation two,

It does not directly include the natural language description of effect in training corpus, that is to say, that in this case, in training corpus simultaneously There is no these words for being expressly recited of whitening, oil-control, anti-aging etc..At this moment, the information extraction device can to the training corpus into After row semantic analysis, which is labeled.

It is understood that the information extraction device is labeled each word in the training corpus in the embodiment of the present application Method, a kind of mark instruction that user's input can be received for the information extraction device, the information extraction device is according to the mark Instruction is labeled each word in the training corpus, to obtain the label of each word in the training corpus.Second It can be labeled automatically according to pre-set rule for the information extraction device, as the pre-set rule can be by User is arranged, i.e., is arranged the rule before information extraction device is labeled.Wherein, which can indicate The rule of corresponding relationship between effect and label, such as number 1 indicate whitening, indicate oil-control with number 2, indicate to mend with number 3 Water indicates that anti-aging, number 0 indicate other etc. with number 4, and the embodiment of the present application is not construed as limiting the rule.

203, each word in above-mentioned training corpus is encoded, obtain each word in above-mentioned training corpus it is one-dimensional to Amount.

It, can will be every in the training corpus after being encoded to each word in training corpus in the embodiment of the present application A word is converted into the vector of regular length.It is understood that being encoded to each word in training corpus, may also comprise to training Each character in corpus is encoded.

For example, the vector of regular length may include the vector that length is 100, if training corpus includes " salicylic Whitening effect is fine ", wherein can be obtained [X1, X2, X3, X4, X5, X6, X7, X8, X9, X10] after being encoded to it, wherein 1 to 10 can indicate the length of sentence in the training corpus, and X can indicate one-dimensional vector.

204, the one-dimensional vector of each word in the label and above-mentioned training corpus of word each in above-mentioned training corpus is defeated Enter to above-mentioned two-way LSTM network, the above-mentioned two-way LSTM network of training；Wherein, in above-mentioned training corpus the label of each word with it is upper Stating has corresponding relationship between the one-dimensional vector of each word in training corpus.

The application is that victor is red, in training corpus the one-dimensional vector of each word be represented by [X1, X2, X3, X4 ..., Xn], the label y of each word can be [0,0,0,1 ... ..., 0] in the training corpus, this is input to two-way through one-dimensional vector and y Learnt in LSTM network, specifically, as shown in figure 3, study can be calculated by the following formula.

i_t=σ (W_ix(t)+U_ih(t-1)+b_i) (1)

f_t=σ (W_fx(t)+U_fh(t-1)+b_f) (2)

o_t=σ (W_ox(t)+U_oh(t-1)+b_o) (3)

h_t=σ_t*tanh(C_t) (6)

Wherein, σ is sigmoid function, and W, U indicate weight, and h (t) indicates that the hidden state of t moment, b indicate biasing, Middle C indicates unit memory, and i indicates that input vector, f indicate forgetting rate, and 0 indicates output vector.

Wherein i determines knots modification of the vector X for the memory state C in unit of input, f then have decided on whether using Memory state before and to use how much.

The weight parameter in two-way LSTM network, such as W, U and b can be trained by one-dimensional vector Xn and corresponding y. It is understood that by the above parameter of training, namely forgetting door, input gate and out gate in training or study LSTM network in cell In weight parameter etc. can apply the application real by the LSTM network to may make after determining weight parameter Apply method shown in FIG. 1 provided by example.

Implement the embodiment of the present application, by collecting a large amount of training corpus, can effectively complete to two-way LSTM network Training also may make information extraction device to improve information extraction so that the training effect of the two-way LSTM network not only can be improved Efficiency.

It is above-mentioned to illustrate the method for the embodiment of the present application, the device of the embodiment of the present application is provided below.

Referring to fig. 4, Fig. 4 is a kind of structural schematic diagram of information extraction device provided by the embodiments of the present application, which takes out Device is taken to can be used for executing Fig. 1 and method shown in Fig. 2, as shown in figure 4, the information extraction device includes:

Acquiring unit 401, for obtaining the first data, above-mentioned first data are the data of targeted efficacy to be extracted；

Input-output unit 402, for above-mentioned first data to be input in two-way shot and long term memory LSTM network, output Second data；Wherein, above-mentioned second data are used to indicate the label of above-mentioned first data；

Extracting unit 403 is obtained for extracting target labels from above-mentioned second data, and according to above-mentioned target labels State targeted efficacy；Wherein, there is corresponding relationship between above-mentioned target labels and above-mentioned targeted efficacy.

Specifically, above-mentioned acquiring unit 401, is also used to obtain training corpus；

Optionally, as shown in figure 5, above-mentioned apparatus further include:

Unit 404 is marked, for being labeled to each word in above-mentioned training corpus, is obtained every in above-mentioned training corpus The label of a word；

Coding unit 405 obtains every in above-mentioned training corpus for encoding to each word in above-mentioned training corpus The one-dimensional vector of a word；

Training unit 406, for each word in the label and above-mentioned training corpus by word each in above-mentioned training corpus One-dimensional vector be input to above-mentioned two-way LSTM network, the above-mentioned two-way LSTM network of training；Wherein, each in above-mentioned training corpus There is between the one-dimensional vector of each word corresponding relationship in the label of word and above-mentioned training corpus.

Specifically, above-mentioned acquiring unit 401, is specifically used for obtaining above-mentioned training corpus using web crawlers method.

It is understood that the specific implementation of Fig. 4 and information extraction device shown in fig. 5 can refer to Fig. 1 and side shown in Fig. 2 The implementation of method, is no longer described in detail one by one here.

It is a kind of structural schematic diagram of information extraction device provided by the embodiments of the present application referring to Fig. 6, Fig. 6, which takes out Taking device includes processor 601, memory 602 and input/output interface 603, the processor 601, memory 602 and input Output interface 603 is connected with each other by bus.

Memory 602 include but is not limited to be random access memory (random access memory, RAM), it is read-only Memory (read-only memory, ROM), Erasable Programmable Read Only Memory EPROM (erasable programmable Read only memory, EPROM) or portable read-only memory (compact disc read-only memory, CD- ROM), which is used for dependent instruction and data.

Input/output interface 603, such as can be communicated etc. by the input/output interface with other devices.Such as input Output interface can be used for obtaining the trained two-way LSTM network of other devices transmission.The for another example input/output interface It can also be used to obtain first data etc., the embodiment of the present application is not construed as limiting.

Processor 601 can be one or more central processing units (central processing unit, CPU), locate In the case that reason device 601 is a CPU, which can be monokaryon CPU, be also possible to multi-core CPU.

Specifically, the realization of each operation can also correspond to referring to Figures 1 and 2 shown in embodiment of the method accordingly retouch It states.And the realization of each operation can also corresponding description to should refer to Fig. 4 and Installation practice shown in fig. 5.

As in one embodiment, processor 601 can be used for executing method shown in step 101 to step 103, for another example should Processor 601 can also be used to execute side performed by acquiring unit 401, input-output unit 402, extraction determination unit 403 etc. Method.

Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, the process Relevant hardware can be instructed to complete by computer program, which can be stored in computer-readable storage medium, should Program is when being executed, it may include such as the process of above-mentioned each method embodiment.And storage medium above-mentioned includes: ROM or deposits at random Store up the medium of the various program storage codes such as memory body RAM, magnetic or disk.

Claims

1. a kind of information extraction method characterized by comprising

First data are input in two-way shot and long term memory LSTM network, the second data are exported；Wherein, second number According to for indicating the label of first data；

Target labels are extracted from second data, and obtain the targeted efficacy according to the target labels；Wherein, described There is corresponding relationship between target labels and the targeted efficacy.

2. the method according to claim 1, wherein the length of the length of first data and second data It spends identical.

3. method according to claim 1 or 2, which is characterized in that first data include that the natural language of product is retouched It states, the effect of targeted efficacy includes the product.

4. according to the method described in claim 3, it is characterized in that, described be input to two-way shot and long term note for first data Before recalling LSTM network, the method also includes:

Obtain training corpus；

The one-dimensional vector of each word in the label and the training corpus of word each in the training corpus is input to described Two-way LSTM network, the training two-way LSTM network；Wherein, the label of each word and the trained language in the training corpus There is between the one-dimensional vector of each word corresponding relationship in material.

5. according to the method described in claim 4, it is characterized in that, the acquisition training corpus, comprising:

The training corpus is obtained using web crawlers method.

6. a kind of information extraction device characterized by comprising

Input-output unit, for first data to be input in two-way shot and long term memory LSTM network, the second number of output According to；Wherein, second data are used to indicate the label of first data；

Extracting unit obtains the target for extracting target labels from second data, and according to the target labels Effect；Wherein, there is corresponding relationship between the target labels and the targeted efficacy.

7. device according to claim 6, which is characterized in that

The acquiring unit is also used to obtain training corpus；

Described device further include:

It marks unit and obtains each word in the training corpus for being labeled to each word in the training corpus Label；

Coding unit obtains each word in the training corpus for encoding to each word in the training corpus One-dimensional vector；

Training unit, in the label and the training corpus by word each in the training corpus each word it is one-dimensional to Amount is input to the two-way LSTM network, the training two-way LSTM network；Wherein, in the training corpus each word label There is corresponding relationship between the one-dimensional vector of word each in the training corpus.

8. device according to claim 7, which is characterized in that

The acquiring unit is specifically used for obtaining the training corpus using web crawlers method.

9. a kind of information extraction device, which is characterized in that including processor, memory and input/output interface, the processor It is interconnected with the memory, the input/output interface by route；Wherein, the memory is stored with program instruction, described When program instruction is executed by the processor, the processor is made to execute the corresponding method as described in claim 1 to 5.

10. a kind of computer readable storage medium, which is characterized in that be stored with computer in the computer readable storage medium Program, the computer program include program instruction, and described program instruction makes when being executed by the processor of information extraction device The processor perform claim requires method described in 1 to 5 any one.