CN109145294A - Text entities recognition methods and device, electronic equipment, storage medium - Google Patents

Text entities recognition methods and device, electronic equipment, storage medium Download PDF

Info

Publication number
CN109145294A
CN109145294A CN201810890287.8A CN201810890287A CN109145294A CN 109145294 A CN109145294 A CN 109145294A CN 201810890287 A CN201810890287 A CN 201810890287A CN 109145294 A CN109145294 A CN 109145294A
Authority
CN
China
Prior art keywords
word
attribute vector
target
similar
target word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810890287.8A
Other languages
Chinese (zh)
Other versions
CN109145294B (en
Inventor
陈红生
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sankuai Online Technology Co Ltd
Original Assignee
Beijing Sankuai Online Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sankuai Online Technology Co Ltd filed Critical Beijing Sankuai Online Technology Co Ltd
Priority to CN201810890287.8A priority Critical patent/CN109145294B/en
Publication of CN109145294A publication Critical patent/CN109145294A/en
Application granted granted Critical
Publication of CN109145294B publication Critical patent/CN109145294B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Present disclose provides a kind of text entities recognition methods and devices, belong to field of computer technology.This method comprises: text to be identified is segmented, the target word sequence of multiple target word compositions is obtained;Each target word is matched with default dictionary, and obtains the attribute vector of the target word from default dictionary according to matched result;If there are unregistered words in target word, it fails to match with default dictionary, and the similar word of multiple unregistered words is obtained from default dictionary, and the attribute vector of unregistered word is calculated according to the attribute vector of similar word;According to the attribute vector of each target word, target word sequence is converted into attribute vector sequence, and handle attribute vector sequence by machine learning model, obtains the Entity recognition result of each target word in text to be identified.The disclosure can reduce the workload of artificial mark text data, and improve the accuracy of Entity recognition.

Description

Text entities recognition methods and device, electronic equipment, storage medium
Technical field
This disclosure relates to field of computer technology more particularly to a kind of text entities recognition methods and device, electronic equipment, Computer readable storage medium.
Background technique
With the development of computer technology, artificial intelligence is more and more paid attention to, wherein natural language processing conduct One important branch of artificial intelligence, search, intelligent customer service, machine translation, text check and correction, in terms of obtain It is widely applied.
In natural language processing, it is often necessary to carry out Entity recognition, Entity recognition refers to the tool to specific word in text Body type is identified, in order to the meaning of the subsequent determination word, or the intention etc. of identification text input person.
The type that existing text entities recognition methods majority needs predefined to identify, and need for different knowledges Other field and application scenarios, manually mark text data, to train suitable model, are carried out by model to new text real Body identification.However, model training usually requires very more text datas, the human cost manually marked to these data It is very high, and for the application scenarios of different field, the type of text data and the type of mark are different, also considerably increase The workload manually marked.
It should be noted that information is only used for reinforcing the reason to the background of the disclosure disclosed in above-mentioned background technology part Solution, therefore may include the information not constituted to the prior art known to persons of ordinary skill in the art.
Summary of the invention
Present disclose provides a kind of text entities recognition methods and device, electronic equipment, computer readable storage medium, into And overcomes the problems, such as that existing text entities recognition methods depends at least to a certain extent and manually mark a large amount of text datas.
Other characteristics and advantages of the disclosure will be apparent from by the following detailed description, or partially by the disclosure Practice and acquistion.
According to one aspect of the disclosure, a kind of text entities recognition methods is provided, comprising: divided text to be identified Word obtains the target word sequence of multiple target word compositions;Each target word is matched with default dictionary, and according to matching Result the attribute vector of the target word is obtained from the default dictionary;If in the target word there are unregistered word with It fails to match for the default dictionary, then the similar word of multiple unregistered words is obtained from the default dictionary, and according to institute The attribute vector for stating similar word calculates the attribute vector of the unregistered word;According to the attribute vector of each target word, by institute It states target word sequence and is converted to attribute vector sequence, and the attribute vector sequence is handled by machine learning model, Obtain the Entity recognition result of each target word in the text to be identified.
In a kind of exemplary embodiment of the disclosure, the phase of multiple unregistered words is obtained from the default dictionary It include: that the unregistered word is converted into target term vector like word;It is searched and the target term vector from the default dictionary Word corresponding to similar term vector, the similar word as the unregistered word.
In a kind of exemplary embodiment of the disclosure, searched from the default dictionary similar to the target term vector Term vector corresponding to word include: the traversal default dictionary, search all cosine similarities with the target term vector Word corresponding to term vector less than a similar threshold value.
In a kind of exemplary embodiment of the disclosure, the method also includes: by the unregistered word and it is described not The attribute vector of posting term is added to the default dictionary.
In a kind of exemplary embodiment of the disclosure, the unregistered word is calculated according to the attribute vector of the similar word Attribute vector include: that averaging calculating is carried out to the attribute vector of each similar word, to obtain the category of the unregistered word Property vector.
In a kind of exemplary embodiment of the disclosure, the machine learning model includes conditional random field models or nerve Network model.
In a kind of exemplary embodiment of the disclosure, the attribute vector includes at least one of following dimension: word Property information, word length, word frequency and left and right entropy information.
According to one aspect of the disclosure, a kind of text entities identification device is provided, comprising: word segmentation processing module is used for Text to be identified is segmented, the target word sequence of multiple target word compositions is obtained;Feature obtains module, and being used for will be each described Target word is matched with default dictionary, and obtains the attribute of the target word from the default dictionary according to matched result Vector;Similar processing module, in the target word there are unregistered word and the default dictionary it fails to match when, from institute The similar word that multiple unregistered words are obtained in default dictionary is stated, and described not according to the calculating of the attribute vector of the similar word The attribute vector of posting term;Series processing module, for the attribute vector according to each target word, by the target word sequence Attribute vector sequence is converted to, and the attribute vector sequence is handled by machine learning model, is obtained described wait know The Entity recognition result of each target word in other text.
According to one aspect of the disclosure, a kind of electronic equipment is provided, comprising: processor;And memory, for storing The executable instruction of the processor;Wherein, the processor is configured to above-mentioned to execute via the executable instruction is executed Method described in any one.
According to one aspect of the disclosure, a kind of computer readable storage medium is provided, computer program is stored thereon with, The computer program realizes method described in above-mentioned any one when being executed by processor.
The exemplary embodiment of the disclosure has the advantages that
For each target word in text to be identified, it is obtained in such a way that default dictionary matching or similar word calculate Attribute vector, and be attribute vector sequence by text conversion to be identified, it is handled by machine learning model, obtains each target The Entity recognition result of word.On the one hand, it for the unregistered word in text to be identified, can be calculated by the similar word marked Its attribute vector, to reduce the workload of artificial mark text data, has saved people without specially marking to unregistered word Power cost.On the other hand, Entity recognition is carried out to target word based on the attribute vector of target word, attribute vector may include target The information of the multiple dimensions of word, to increase the accuracy of Entity recognition.
It should be understood that above general description and following detailed description be only it is exemplary and explanatory, not The disclosure can be limited.
Detailed description of the invention
The drawings herein are incorporated into the specification and forms part of this specification, and shows the implementation for meeting the disclosure Example, and together with specification for explaining the principles of this disclosure.It should be evident that the accompanying drawings in the following description is only the disclosure Some embodiments for those of ordinary skill in the art without creative efforts, can also basis These attached drawings obtain other attached drawings.
Fig. 1 schematically shows a kind of flow chart of text entities recognition methods in the present exemplary embodiment;
Fig. 2 schematically shows the flow chart of another text entities recognition methods in the present exemplary embodiment;
Fig. 3 schematically shows a kind of flow chart of text Entity recognition in application scenarios in the present exemplary embodiment;
Fig. 4 schematically shows a kind of structural block diagram of text entities identification device in the present exemplary embodiment;
Fig. 5 schematically shows a kind of electronic equipment for realizing the above method in the present exemplary embodiment;
Fig. 6 schematically shows a kind of computer-readable storage medium for realizing the above method in the present exemplary embodiment Matter.
Specific embodiment
Example embodiment is described more fully with reference to the drawings.However, example embodiment can be with a variety of shapes Formula is implemented, and is not understood as limited to example set forth herein;On the contrary, thesing embodiments are provided so that the disclosure will more Fully and completely, and by the design of example embodiment comprehensively it is communicated to those skilled in the art.Described feature, knot Structure or characteristic can be incorporated in any suitable manner in one or more embodiments.
The exemplary embodiment of the disclosure provides a kind of text entities recognition methods.Refering to what is shown in Fig. 1, this method can be with Include the following steps S11~S14:
Step S11 segments text to be identified, obtains the target word sequence of multiple target word compositions.
Text to be identified is usually in short, to be made of multiple words, each word, that is, target word therein.By segmenting tool It may be implemented to segment, such as jieba participle, THULAC (the Chinese lexical analysis tool of Tsinghua University's exploitation), LTP (Harbin Institute of Technology The Chinese natural language handling implement of exploitation) etc..After participle, position of each word in text to be identified can be retained and closed System, thus by text conversion to be identified at target word sequence.When with participle tool processing, no practical significance can also be removed Structural word or modal particle, such as " ", " ", " asking " etc..
Each target word is matched with default dictionary, and is obtained from default dictionary according to matched result by step S12 The attribute vector of target word.
Wherein, attribute vector refers to that the information by word in multiple dimensions is characterized by way of vector, attribute to In amount, the information of every dimension can be converted to numerical value, to characterize the attribute of corresponding word in some respect.
In one exemplary embodiment, attribute vector may include at least one of following dimension: part-of-speech information, word are long Degree, word frequency and left and right entropy information.Wherein, left and right entropy information generally comprises left side (above) entropy information and right side (hereafter) entropy is believed Breath, refers in certain corpus, and the uncertainty of word adjacent with right side on the left of target word, left and right entropy is higher, adjacent word Uncertainty it is higher.It should be noted that the primitive form of part-of-speech information is generally text, such as target word is noun, moves Word etc. can be converted into numerical value, such as noun=1, verb=2 etc. by preset numerical value conversion rule, for other The nonumeric attribute of type can also carry out numerical value conversion using similar method.
Default dictionary can be considered a word database, wherein containing the attribute vector of some words.In the initial stage, preset Dictionary can be obtained by way of marking or calculating, such as certain corpus, can use participle tool will wherein Text segmented, to each word carry out part-of-speech tagging, count the word length and word frequency of each word, count the adjacent of each word Word situation simultaneously calculates left and right entropy etc., to obtain the attribute vector of each word, forms initial default dictionary.In use process In, the attribute value of neologisms or the original word of change can also be added into default dictionary as needed.
If target word can directly match the attribute vector of target word in default dictionary from default dictionary; If target word in default dictionary, does not execute following steps S13.
Step S13, if there are unregistered words in target word, it fails to match with default dictionary, obtains from default dictionary The similar word of multiple unregistered words, and according to the attribute vector of the attribute vector of similar word calculating unregistered word.
Wherein, similar word can be near synonym semantically, can obtain in several ways, such as according to Chinese dictionary Inquire unregistered word near synonym, select preset dictionary in similar word of the existing near synonym as unregistered word, or according to The corpus of application scenarios constructs near synonym library, and existing near synonym etc. in default dictionary are searched in the dictionary.In addition, On the basis of near synonym, otherwise requirement can also be increased to similar word, such as require in form and unregistered word Word length is identical, part of speech is mutually same.The present embodiment is not specially limited for obtaining the concrete mode of similar word.
Multiple similar words based on acquisition can count its attribute vector, and each dimension of computation attribute vector is put down Mean value, intermediate value, mode etc., each dimension values of the attribute vector as unregistered word, to obtain the attribute vector of unregistered word. The present embodiment is also not specially limited for calculating the concrete mode of the attribute vector of unregistered word.
Target word sequence is converted to attribute vector sequence, and pass through according to the attribute vector of each target word by step S14 Machine learning model handles attribute vector sequence, obtains the Entity recognition result of each target word in text to be identified.
Wherein, the attribute vector that attribute vector sequence can be each target word is arranged according to the sequence in text to be identified Sequence vector, be also possible to the form of matrix.It, can be in machine learning model by taking attribute vector sequence is matrix as an example The matrix of input is set as m*n matrix, wherein m is the dimension of attribute vector, and n is the preset length of text, if text to be identified The target word quantity of this curtailment n, i.e., text to be identified are less than n, can carry out default filling, such as will be insufficient in matrix Column be filled with 0.The output of machine learning model can be p*n matrix, and wherein p is classification number, indicate that target word carries out entity The candidate categories sum of identification is p.
It should be appreciated that in practical applications, input matrix is also possible to n*m matrix, then corresponding output matrix can be N*p matrix, wherein the meaning of m, n, p are identical as above-mentioned implication.Many machine learning models are suitable for matrix or multidirectional amount The input data of form is handled, such as conditional random field models, neural network model etc., and the present embodiment does not do this especially It limits.
In step S14, machine learning model is trained model.Trained process can be as described below: base In certain corpus, text therein can be segmented, obtained whole words as sample word, and by counting or The mode of mark obtains the attribute vector and entity class label of each sample word, is with the attribute vector sequence of sample word sequence Input, entity class matrix are output, training machine learning model, according to the output of model and the entity class label of mark The parameter of error transfer factor model makes model be finally reached certain accuracy rate, that is, completes the training of machine learning model.When So, the above process is merely illustrative, and in practical application, the training process of machine learning model can also include more multistep Suddenly, and the detail of each step realization may also be distinct from that above situation, and the present embodiment is for training machine learning model Concrete mode and step are not specially limited.
Based on above description, in the present example embodiment, for each target word in text to be identified, by presetting word The mode that storehouse matching or similar word calculate obtains its attribute vector, and is attribute vector sequence by text conversion to be identified, leads to It crosses machine learning model to be handled, obtains the Entity recognition result of each target word.On the one hand, in text to be identified not Posting term can calculate its attribute vector by the similar word marked, without specially marking to unregistered word, to reduce The workload of artificial mark text data, has saved human cost.On the other hand, based on the attribute vector of target word to target word Entity recognition is carried out, attribute vector may include the information of the multiple dimensions of target word, to increase the accuracy of Entity recognition.
In one exemplary embodiment, the similar word that multiple unregistered words are obtained from default dictionary can be by following step It is rapid to realize:
Unregistered word is converted into target term vector;
Lookup and word corresponding to term vector similar in target term vector from default dictionary, as the similar of unregistered word Word.
Wherein, the conversion of target term vector can be realized by Word2vec tool, it should be noted that target term vector Refer to the vector being identified by the feature of multiple dimensions to target word, is not the vector of one-hot coding, and and attribute Vector is different vector.Each dimension of target term vector has not required specific meaning, and each dimension of attribute vector Represent an attribute of target word.In the present embodiment, default dictionary also contains the term vector of wherein each word, therefore can be with The similar word of unregistered word is searched by method similar in term vector.Rule similar in term vector, such as two can be preset Each dimension numerical value of a term vector differs within 10%, judges that the mould of its close or two term vectors difference vector is small In certain value, judge that its is close etc..
Further, searched from default dictionary word corresponding to similar with target term vector term vector can by with Lower step is realized: traverse default dictionary, search all cosine similarities with target term vector be greater than the word of a similar threshold value to The corresponding word of amount.Usual cosine similarity is in the range of [- 1,1], and value is bigger, and to represent two term vectors more close.Similar threshold Value can rule of thumb set an initial value, in actual application, adjust similar threshold value by iteration, reach more excellent Numerical value, can preferably filter out similar word.
In addition it is also possible to preset the quantity of similar word, such as it is set as M, then when searching similar word, can screens Out with the maximum M word of the cosine similarity of the term vector of unregistered word, to calculate the attribute vector of unregistered word.
In one exemplary embodiment, when the attribute vector of the similar word by unregistered word calculates the category of unregistered word After property vector, the attribute vector of unregistered word and unregistered word can also be added to default dictionary.So as to realize pair The update of default dictionary, in order to subsequent calls.
In one exemplary embodiment, can be passed through according to the attribute vector that the attribute vector of similar word calculates unregistered word Following steps are realized: averaging calculating are carried out to the attribute vector of each similar word, to obtain the attribute vector of unregistered word.Its In, being averaging calculating can be arithmetic average, be also possible to be weighted and averaged.
For example, it is assumed that unregistered word x has 3 similar words a, b, c, and wherein the attribute vector of a is (a1, a2, a3), b Attribute vector be (b1, b2, b3), the attribute vector of c is (c1, c2, c3).For the algorithm of arithmetic mean, can by with Lower formula calculates the attribute vector of x:
X=(x1, x2, x3)=1/3 (a1+b1+c1, a2+b2+c2, a3+b3+c3);
For average weighted algorithm, the attribute vector of x can be calculated by the following formula:
X=(x1, x2, x3)=1/ (csa+csb+csc) (csaa1+csbb1+cscc1, csaa2+ Csbb2+cscc2, csaa3+csbb3+cscc3);
(x is a) the term vector cosine similarity of x and a, csb, csc are respectively the term vector of x Yu b, x and c to csa=cos Cosine similarity, also i.e. using cosine similarity as average weighted weight.In addition it is also possible to by the word frequency of each similar word or its As average weighted weight, the present embodiment is not specially limited this for his attribute value.
It is averaging to calculate and smoothing processing actually has been carried out to the attribute vector of similar word, Fig. 2 shows a kind of texts The exemplary process diagram of entity recognition method is segmented after obtaining text to be identified, and to target word extract attribute to Amount, for unregistered word, can obtain its attribute vector by way of similar word attribute smoothing processing, finally by attribute vector Sequence inputting can export corresponding Entity recognition result into machine learning model.
In one exemplary embodiment, machine learning model may include conditional random field models or neural network model. Wherein, conditional random field models calculate the probabilistic relation between text and entity class sequence to be identified, can be by probability highest Result of the entity class sequence as Entity recognition;Neural network model is defeated with the symbolic animal of the birth year sequence vector of text to be identified Enter, be converted to the probability of each entity class, and determines the result of final Entity recognition by the methods of softmax.
In the application scenarios that user searches for hotel, the probabilistic algorithm of conditional random field models can be as described below:
Wherein, y is entity class, and x is attribute vector, and Z (x) is standardizing factor, tkFor transfer characteristic function, slFor shape State characteristic function, λkWith μkFor model parameter;B- Merchant name, which refers to, starts class Merchant name, and E- Merchant name refers to ending class businessman Title.
In one exemplary embodiment, can the probabilistic algorithm to conditional random field models carry out certain improvement:
Wherein, y is entity class, and x is attribute vector, and Z (x) is standardizing factor, VkFor the spy of conditional random field models Levy function.wkFor the weight coefficient for presetting existing word in dictionary, fk is transfer characteristic function.M is similar word quantity, wk' it is each The weight coefficient of similar word, sim are the similarity function (such as can be cosine similarity etc.) of similar word and unregistered word.It can See, improved conditional random field models can preferably handle unregistered word.
It is described further so that user searches for the application scenarios in dining room as an example below.Refering to what is shown in Fig. 3, user have input to It identifies text " snack around the railway station of Qinhuangdao ", server needs to preferably confirm that user is intended to text to be identified Each word carry out Entity recognition.It can first be segmented, obtain target word " Qinhuangdao ", " railway station ", " surrounding ", " small Eat ", then the attribute vector of each target word is obtained by presetting dictionary.Wherein, " railway station ", " surrounding ", " snack " are default word Existing word in library, can directly match attribute vector, and " Qinhuangdao " is unregistered word, can find in default dictionary Several similar words, such as " Bei Dai River ", " Shijiazhuang ", " Lianyun Harbour " etc., and calculated according to the attribute vector of these similar words The attribute vector in " Qinhuangdao ", to be attribute vector sequence by text conversion to be identified.By attribute vector sequence inputting to instruction In the neural network model perfected, matrix of consequence can be exported.In matrix of consequence, each column can represent each target word Candidate categories, wherein the numerical value for the classification being identified can be " 1 ", remaining classification can be " 0 "." B- terrestrial reference " in Fig. 3 It indicates to start class terrestrial reference, " M- terrestrial reference " indicates that intermediate terrestrial reference, " E- terrestrial reference " indicate ending class terrestrial reference.Finally obtain each target word Entity recognition as a result, " Qinhuangdao "=" B- terrestrial reference ", " railway station "=" E- terrestrial reference ", " surrounding "=" meaningless ", " snack " =" category ".
The exemplary embodiment of the disclosure additionally provides a kind of text entities identification device, refering to what is shown in Fig. 4, the device 40 It may include: word segmentation processing module 41, for segmenting text to be identified, obtain the target word order of multiple target word compositions Column;Feature obtains module 42, for matching each target word with default dictionary, and according to matched result from default dictionary The middle attribute vector for obtaining target word;Similar processing module 43, for there are unregistered words and default dictionary in target word When with failure, the similar word of multiple unregistered words is obtained from default dictionary, and do not step on according to the calculating of the attribute vector of similar word Record the attribute vector of word;Target word sequence is converted to category for the attribute vector according to each target word by series processing module 44 Property sequence vector, and attribute vector sequence is handled by machine learning model, obtain each target word in text to be identified Entity recognition result.
In one exemplary embodiment, similar processing module can be also used for being converted to unregistered word into target term vector, And it is searched from default dictionary with word corresponding to term vector similar in target term vector, the similar word as unregistered word.
In one exemplary embodiment, similar processing module can be also used for traversing default dictionary, search all and target Word corresponding to term vector of the cosine similarity of term vector less than a similar threshold value.
In one exemplary embodiment, text entities identification device can also include: dictionary management module, for that will not step on The attribute vector of record word and unregistered word is added to default dictionary.
In one exemplary embodiment, similar processing module can be also used for carrying out asking flat to the attribute vector of each similar word It calculates, to obtain the attribute vector of unregistered word.
In one exemplary embodiment, machine learning model may include conditional random field models or neural network model.
In one exemplary embodiment, attribute vector may include at least one of following dimension: part-of-speech information, word are long Degree, word frequency and left and right entropy information.
The detail of above each module has been described in detail in the embodiment of method part, therefore is repeated no more.
The exemplary embodiment of the disclosure additionally provides a kind of electronic equipment that can be realized the above method.
Person of ordinary skill in the field it is understood that various aspects of the disclosure can be implemented as system, method or Program product.Therefore, various aspects of the disclosure can be with specific implementation is as follows, it may be assumed that complete hardware embodiment, complete The embodiment combined in terms of full Software Implementation (including firmware, microcode etc.) or hardware and software, can unite here Referred to as circuit, " module " or " system ".
The electronic equipment 500 of this exemplary embodiment according to the disclosure is described referring to Fig. 5.What Fig. 5 was shown Electronic equipment 500 is only an example, should not function to the embodiment of the present disclosure and use scope bring any restrictions.
As shown in figure 5, electronic equipment 500 is showed in the form of universal computing device.The component of electronic equipment 500 can wrap It includes but is not limited to: at least one above-mentioned processing unit 510, at least one above-mentioned storage unit 520, the different system components of connection The bus 530 of (including storage unit 520 and processing unit 510), display unit 540.
Wherein, storage unit is stored with program code, and program code can be executed with unit 510 processed, so that processing is single Member 510 executes the step described in above-mentioned " illustrative methods " part of this specification according to the various illustrative embodiments of the disclosure Suddenly.For example, processing unit 510 can execute step S11~S14 shown in FIG. 1 etc..
Storage unit 520 may include the readable medium of volatile memory cell form, such as Random Access Storage Unit (RAM) 521 and/or cache memory unit 522, it can further include read-only memory unit (ROM) 523.
Storage unit 520 can also include program/utility 524 with one group of (at least one) program module 525, Such program module 525 includes but is not limited to: operating system, one or more application program, other program modules and It may include the realization of network environment in program data, each of these examples or certain combination.
Bus 530 can be to indicate one of a few class bus structures or a variety of, including storage unit bus or storage Cell controller, peripheral bus, graphics acceleration port, processing unit use any bus structures in a variety of bus structures Local bus.
Electronic equipment 500 can also be with one or more external equipments 700 (such as keyboard, sensing equipment, bluetooth equipment Deng) communication, can also be enabled a user to one or more equipment interact with the electronic equipment 500 communicate, and/or with make Any equipment (such as the router, modulation /demodulation that the electronic equipment 500 can be communicated with one or more of the other calculating equipment Device etc.) communication.This communication can be carried out by input/output (I/O) interface 550.Also, electronic equipment 500 can be with By network adapter 560 and one or more network (such as local area network (LAN), wide area network (WAN) and/or public network, Such as internet) communication.As shown, network adapter 560 is communicated by bus 530 with other modules of electronic equipment 500. It should be understood that although not shown in the drawings, other hardware and/or software module can not used in conjunction with electronic equipment 500, including but not Be limited to: microcode, device driver, redundant processing unit, external disk drive array, RAID system, tape drive and Data backup storage system etc..
Through the above description of the embodiments, those skilled in the art is it can be readily appreciated that example described herein is implemented Mode can also be realized by software realization in such a way that software is in conjunction with necessary hardware.Therefore, according to the disclosure The technical solution of embodiment can be embodied in the form of software products, which can store non-volatile at one Property storage medium (can be CD-ROM, USB flash disk, mobile hard disk etc.) in or network on, including some instructions are so that a calculating Equipment (can be personal computer, server, terminal installation or network equipment etc.) is executed according to the exemplary implementation of the disclosure The method of example.
The exemplary embodiment of the disclosure additionally provides a kind of computer readable storage medium, and being stored thereon with can be realized The program product of this specification above method.In some possible embodiments, various aspects of the disclosure can also be realized For a kind of form of program product comprising program code, when program product is run on the terminal device, program code is used for Execute terminal device described in above-mentioned " illustrative methods " part of this specification according to the various exemplary embodiment party of the disclosure The step of formula.
It is produced refering to what is shown in Fig. 6, describing the program according to the exemplary embodiment of the disclosure for realizing the above method Product 600, can be using portable compact disc read only memory (CD-ROM) and including program code, and can set in terminal It is standby, such as run on PC.However, the program product of the disclosure is without being limited thereto, in this document, readable storage medium storing program for executing can With to be any include or the tangible medium of storage program, the program can be commanded execution system, device or device use or Person is in connection.
Program product can be using any combination of one or more readable mediums.Readable medium can be readable signal Jie Matter or readable storage medium storing program for executing.Readable storage medium storing program for executing for example can be but be not limited to electricity, magnetic, optical, electromagnetic, infrared ray or partly lead System, device or the device of body, or any above combination.More specific example (the non exhaustive column of readable storage medium storing program for executing Table) it include: the electrical connection with one or more conducting wires, portable disc, hard disk, random access memory (RAM), read-only storage Device (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disc read only memory (CD- ROM), light storage device, magnetic memory device or above-mentioned any appropriate combination.
Computer-readable signal media may include in a base band or as carrier wave a part propagate data-signal, In carry readable program code.The data-signal of this propagation can take various forms, including but not limited to electromagnetic signal, Optical signal or above-mentioned any appropriate combination.Readable signal medium can also be any readable Jie other than readable storage medium storing program for executing Matter, the readable medium can send, propagate or transmit for by instruction execution system, device or device use or and its The program of combined use.
The program code for including on readable medium can transmit with any suitable medium, including but not limited to wirelessly, have Line, optical cable, RF etc. or above-mentioned any appropriate combination.
Can with any combination of one or more programming languages come write for execute the disclosure operation program Code, programming language include object oriented program language-Java, C++ etc., further include conventional process Formula programming language-such as " C " language or similar programming language.Program code can be calculated fully in user It executes in equipment, partly execute on a user device, executing, as an independent software package partially in user calculating equipment Upper part executes on a remote computing or executes in remote computing device or server completely.It is being related to remotely counting In the situation for calculating equipment, remote computing device can pass through the network of any kind, including local area network (LAN) or wide area network (WAN), it is connected to user calculating equipment, or, it may be connected to external computing device (such as utilize ISP To be connected by internet).
In addition, above-mentioned attached drawing is only the schematic theory of the processing according to included by the method for disclosure exemplary embodiment It is bright, rather than limit purpose.It can be readily appreciated that the time that above-mentioned processing shown in the drawings did not indicated or limited these processing is suitable Sequence.In addition, be also easy to understand, these processing, which can be, for example either synchronously or asynchronously to be executed in multiple modules.
It should be noted that although being referred to several modules or list for acting the equipment executed in the above detailed description Member, but this division is not enforceable.In fact, according to an exemplary embodiment of the present disclosure, above-described two or More multimode or the feature and function of unit can embody in a module or unit.Conversely, above-described one A module or the feature and function of unit can be to be embodied by multiple modules or unit with further division.
Those skilled in the art after considering the specification and implementing the invention disclosed here, will readily occur to its of the disclosure His embodiment.This application is intended to cover any variations, uses, or adaptations of the disclosure, these modifications, purposes or Adaptive change follow the general principles of this disclosure and including the undocumented common knowledge in the art of the disclosure or Conventional techniques.The description and examples are only to be considered as illustrative, and the true scope and spirit of the disclosure are by claim It points out.
It should be understood that the present disclosure is not limited to the precise structures that have been described above and shown in the drawings, and And various modifications and changes may be made without departing from the scope thereof.The scope of the present disclosure is only limited by the attached claims.

Claims (10)

1. a kind of text entities recognition methods characterized by comprising
Text to be identified is segmented, the target word sequence of multiple target word compositions is obtained;
Each target word is matched with default dictionary, and is obtained from the default dictionary according to matched result described The attribute vector of target word;
If there are unregistered words in the target word, it fails to match with the default dictionary, obtains from the default dictionary The similar word of multiple unregistered words, and according to the attribute vector of the similar word calculate the attribute of the unregistered word to Amount;
According to the attribute vector of each target word, the target word sequence is converted into attribute vector sequence, and pass through machine Learning model handles the attribute vector sequence, obtains the Entity recognition of each target word in the text to be identified As a result.
2. the method according to claim 1, wherein obtaining multiple unregistered words from the default dictionary Similar word include:
The unregistered word is converted into target term vector;
Lookup and word corresponding to term vector similar in the target term vector, are not logged in as described from the default dictionary The similar word of word.
3. according to the method described in claim 2, it is characterized in that, being searched and the target term vector from the default dictionary Word corresponding to similar term vector includes:
Traverse the default dictionary, search all cosine similarities with the target term vector less than a similar threshold value word to The corresponding word of amount.
4. the method according to claim 1, wherein the method also includes:
The attribute vector of the unregistered word and the unregistered word is added to the default dictionary.
5. the method according to claim 1, wherein not stepped on according to the calculating of the attribute vector of the similar word is described Record word attribute vector include:
Averaging calculating is carried out to the attribute vector of each similar word, to obtain the attribute vector of the unregistered word.
6. the method according to claim 1, wherein the machine learning model include conditional random field models or Neural network model.
7. the method according to claim 1, wherein the attribute vector includes at least one in following dimension It is a: part-of-speech information, word length, word frequency and left and right entropy information.
8. a kind of text entities identification device characterized by comprising
Word segmentation processing module obtains the target word sequence of multiple target word compositions for segmenting text to be identified;
Feature obtains module, for matching each target word with default dictionary, and according to matched result from described The attribute vector of the target word is obtained in default dictionary;
Similar processing module, in the target word there are unregistered word and the default dictionary it fails to match when, from institute The similar word that multiple unregistered words are obtained in default dictionary is stated, and described not according to the calculating of the attribute vector of the similar word The attribute vector of posting term;
Series processing module, for the attribute vector according to each target word, by the target word sequence be converted to attribute to Sequence is measured, and the attribute vector sequence is handled by machine learning model, obtains each institute in the text to be identified State the Entity recognition result of target word.
9. a kind of electronic equipment characterized by comprising
Processor;And
Memory, for storing the executable instruction of the processor;
Wherein, the processor is configured to require 1-7 described in any item via executing the executable instruction and carry out perform claim Method.
10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the computer program Claim 1-7 described in any item methods are realized when being executed by processor.
CN201810890287.8A 2018-08-07 2018-08-07 Text entity identification method and device, electronic equipment and storage medium Active CN109145294B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810890287.8A CN109145294B (en) 2018-08-07 2018-08-07 Text entity identification method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810890287.8A CN109145294B (en) 2018-08-07 2018-08-07 Text entity identification method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN109145294A true CN109145294A (en) 2019-01-04
CN109145294B CN109145294B (en) 2022-05-24

Family

ID=64792172

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810890287.8A Active CN109145294B (en) 2018-08-07 2018-08-07 Text entity identification method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN109145294B (en)

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109933788A (en) * 2019-02-14 2019-06-25 北京百度网讯科技有限公司 Type determines method, apparatus, equipment and medium
CN110298044A (en) * 2019-07-09 2019-10-01 广东工业大学 A kind of entity-relationship recognition method
CN110413990A (en) * 2019-06-20 2019-11-05 平安科技(深圳)有限公司 The configuration method of term vector, device, storage medium, electronic device
CN110413819A (en) * 2019-07-12 2019-11-05 深兰科技(上海)有限公司 A kind of acquisition methods and device of picture description information
CN110516241A (en) * 2019-08-26 2019-11-29 北京三快在线科技有限公司 Geographical address analytic method, device, readable storage medium storing program for executing and electronic equipment
CN111104500A (en) * 2019-12-21 2020-05-05 江西省天轴通讯有限公司 Cable matching method, system, readable storage medium and computer equipment
CN111611775A (en) * 2020-05-14 2020-09-01 沈阳东软熙康医疗***有限公司 Entity identification model generation method, entity identification method, device and equipment
CN111666963A (en) * 2019-03-07 2020-09-15 深圳云天励飞技术有限公司 Clothing style identification method, device and equipment
CN111737998A (en) * 2020-06-23 2020-10-02 北京字节跳动网络技术有限公司 Dialect text generation method and device, storage medium and electronic equipment
CN111950277A (en) * 2019-04-30 2020-11-17 中移(苏州)软件技术有限公司 Business situation entity determining method, device and storage medium
CN112307766A (en) * 2020-09-22 2021-02-02 北京京东世纪贸易有限公司 Method, apparatus, electronic device and medium for identifying preset category entities
CN112528666A (en) * 2019-08-30 2021-03-19 北京猎户星空科技有限公司 Semantic recognition method and device and electronic equipment
CN112559732A (en) * 2019-09-25 2021-03-26 阿里巴巴集团控股有限公司 Text processing method, device and system
CN112686035A (en) * 2019-10-18 2021-04-20 北京沃东天骏信息技术有限公司 Method and device for vectorizing unknown words
CN113255326A (en) * 2021-05-18 2021-08-13 北京明略昭辉科技有限公司 Unknown word vector calculation method, system, electronic device and storage medium
CN113569554A (en) * 2021-09-24 2021-10-29 北京明略软件***有限公司 Entity pair matching method and device in database, electronic equipment and storage medium
CN113627180A (en) * 2020-05-09 2021-11-09 北京沃东天骏信息技术有限公司 Method, device, electronic equipment and storage medium for determining type of article
CN114510658A (en) * 2022-02-10 2022-05-17 上海钧正网络科技有限公司 Determination method and device for recommended boarding point, server and terminal equipment
CN114943222A (en) * 2022-05-13 2022-08-26 医渡云(北京)技术有限公司 Time entity identification method and device, computer storage medium and electronic equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103678272A (en) * 2012-09-17 2014-03-26 北京信息科技大学 Method for processing unknown words in Chinese-language dependency tree banks
US20170004208A1 (en) * 2015-07-04 2017-01-05 Accenture Global Solutions Limited Generating a domain ontology using word embeddings
CN106557462A (en) * 2016-11-02 2017-04-05 数库(上海)科技有限公司 Name entity recognition method and system
CN108038103A (en) * 2017-12-18 2018-05-15 北京百分点信息科技有限公司 A kind of method, apparatus segmented to text sequence and electronic equipment

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103678272A (en) * 2012-09-17 2014-03-26 北京信息科技大学 Method for processing unknown words in Chinese-language dependency tree banks
US20170004208A1 (en) * 2015-07-04 2017-01-05 Accenture Global Solutions Limited Generating a domain ontology using word embeddings
CN106557462A (en) * 2016-11-02 2017-04-05 数库(上海)科技有限公司 Name entity recognition method and system
CN108038103A (en) * 2017-12-18 2018-05-15 北京百分点信息科技有限公司 A kind of method, apparatus segmented to text sequence and electronic equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
齐爱芹等: "基于词向量的实体链接方法", 《数据采集与处理》 *

Cited By (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109933788B (en) * 2019-02-14 2023-05-23 北京百度网讯科技有限公司 Type determining method, device, equipment and medium
CN109933788A (en) * 2019-02-14 2019-06-25 北京百度网讯科技有限公司 Type determines method, apparatus, equipment and medium
CN111666963A (en) * 2019-03-07 2020-09-15 深圳云天励飞技术有限公司 Clothing style identification method, device and equipment
CN111666963B (en) * 2019-03-07 2023-10-13 深圳云天励飞技术有限公司 Method, device and equipment for identifying clothes styles
CN111950277A (en) * 2019-04-30 2020-11-17 中移(苏州)软件技术有限公司 Business situation entity determining method, device and storage medium
CN110413990A (en) * 2019-06-20 2019-11-05 平安科技(深圳)有限公司 The configuration method of term vector, device, storage medium, electronic device
CN110298044A (en) * 2019-07-09 2019-10-01 广东工业大学 A kind of entity-relationship recognition method
CN110298044B (en) * 2019-07-09 2023-04-18 广东工业大学 Entity relationship identification method
CN110413819A (en) * 2019-07-12 2019-11-05 深兰科技(上海)有限公司 A kind of acquisition methods and device of picture description information
CN110516241A (en) * 2019-08-26 2019-11-29 北京三快在线科技有限公司 Geographical address analytic method, device, readable storage medium storing program for executing and electronic equipment
CN110516241B (en) * 2019-08-26 2021-03-02 北京三快在线科技有限公司 Geographic address resolution method and device, readable storage medium and electronic equipment
CN112528666A (en) * 2019-08-30 2021-03-19 北京猎户星空科技有限公司 Semantic recognition method and device and electronic equipment
CN112559732A (en) * 2019-09-25 2021-03-26 阿里巴巴集团控股有限公司 Text processing method, device and system
CN112686035A (en) * 2019-10-18 2021-04-20 北京沃东天骏信息技术有限公司 Method and device for vectorizing unknown words
CN111104500A (en) * 2019-12-21 2020-05-05 江西省天轴通讯有限公司 Cable matching method, system, readable storage medium and computer equipment
CN113627180A (en) * 2020-05-09 2021-11-09 北京沃东天骏信息技术有限公司 Method, device, electronic equipment and storage medium for determining type of article
CN111611775A (en) * 2020-05-14 2020-09-01 沈阳东软熙康医疗***有限公司 Entity identification model generation method, entity identification method, device and equipment
CN111611775B (en) * 2020-05-14 2023-07-18 沈阳东软熙康医疗***有限公司 Entity identification model generation method, entity identification device and equipment
CN111737998A (en) * 2020-06-23 2020-10-02 北京字节跳动网络技术有限公司 Dialect text generation method and device, storage medium and electronic equipment
CN112307766A (en) * 2020-09-22 2021-02-02 北京京东世纪贸易有限公司 Method, apparatus, electronic device and medium for identifying preset category entities
CN113255326A (en) * 2021-05-18 2021-08-13 北京明略昭辉科技有限公司 Unknown word vector calculation method, system, electronic device and storage medium
CN113569554A (en) * 2021-09-24 2021-10-29 北京明略软件***有限公司 Entity pair matching method and device in database, electronic equipment and storage medium
CN113569554B (en) * 2021-09-24 2021-12-28 北京明略软件***有限公司 Entity pair matching method and device in database, electronic equipment and storage medium
CN114510658A (en) * 2022-02-10 2022-05-17 上海钧正网络科技有限公司 Determination method and device for recommended boarding point, server and terminal equipment
CN114943222A (en) * 2022-05-13 2022-08-26 医渡云(北京)技术有限公司 Time entity identification method and device, computer storage medium and electronic equipment

Also Published As

Publication number Publication date
CN109145294B (en) 2022-05-24

Similar Documents

Publication Publication Date Title
CN109145294A (en) Text entities recognition methods and device, electronic equipment, storage medium
CN112668671B (en) Method and device for acquiring pre-training model
KR102302367B1 (en) Method and device for judging validity of point of interest based on internet text mining
CN108920467A (en) Polysemant lexical study method and device, search result display methods
CN110377903B (en) Sentence-level entity and relation combined extraction method
CN108846077A (en) Semantic matching method, device, medium and the electronic equipment of question and answer text
CN109726396A (en) Semantic matching method, device, medium and the electronic equipment of question and answer text
CN110704621A (en) Text processing method and device, storage medium and electronic equipment
JP7417679B2 (en) Information extraction methods, devices, electronic devices and storage media
CN110347908B (en) Voice shopping method, device, medium and electronic equipment
CN114330354B (en) Event extraction method and device based on vocabulary enhancement and storage medium
CN111177186A (en) Question retrieval-based single sentence intention identification method, device and system
CN115516447A (en) Hot news intention identification method, device and equipment and readable storage medium
CN108268629B (en) Image description method and device based on keywords, equipment and medium
KR20220042838A (en) Systems and Methods to Provide AI Answering Service for Legal Questions
WO2024114186A1 (en) Intent recognition method and related device
Li et al. Intention understanding in human–robot interaction based on visual-NLP semantics
CN111368555A (en) Data identification method and device, storage medium and electronic equipment
CN110555093B (en) Text matching method, device and equipment
US20200110834A1 (en) Dynamic Linguistic Assessment and Measurement
CN115269768A (en) Element text processing method and device, electronic equipment and storage medium
CN110929499B (en) Text similarity obtaining method, device, medium and electronic equipment
CN112100360A (en) Dialog response method, device and system based on vector retrieval
CN117076636A (en) Information query method, system and equipment for intelligent customer service
CN108733702B (en) Method, device, electronic equipment and medium for extracting upper and lower relation of user query

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant