WO2021146831A1 - 实体识别的方法和装置、建立词典的方法、设备、介质 - Google Patents

实体识别的方法和装置、建立词典的方法、设备、介质 Download PDF

Info

Publication number
WO2021146831A1
WO2021146831A1 PCT/CN2020/073155 CN2020073155W WO2021146831A1 WO 2021146831 A1 WO2021146831 A1 WO 2021146831A1 CN 2020073155 W CN2020073155 W CN 2020073155W WO 2021146831 A1 WO2021146831 A1 WO 2021146831A1
Authority
WO
WIPO (PCT)
Prior art keywords
word
speech classification
target text
function
score
Prior art date
Application number
PCT/CN2020/073155
Other languages
English (en)
French (fr)
Inventor
代亚菲
Original Assignee
京东方科技集团股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 京东方科技集团股份有限公司 filed Critical 京东方科技集团股份有限公司
Priority to US17/299,023 priority Critical patent/US20220318509A1/en
Priority to EP20891421.8A priority patent/EP4095738A4/en
Priority to CN202080000047.1A priority patent/CN113632092A/zh
Priority to PCT/CN2020/073155 priority patent/WO2021146831A1/zh
Publication of WO2021146831A1 publication Critical patent/WO2021146831A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Definitions

  • the embodiments of the present disclosure relate to the field of computer technology, in particular to a method and device for named entity recognition, a method for establishing a named entity dictionary, electronic equipment, and computer-readable media.
  • the embodiments of the present disclosure provide a method and device for named entity recognition, a method for establishing a named entity dictionary, electronic equipment, and computer-readable media.
  • embodiments of the present disclosure provide a method for named entity recognition, including:
  • the conditional random field model includes a plurality of scoring functions, and the scoring function includes a semantic function and at least one template function
  • Each of the template functions is used to give the score of each word in the target text corresponding to each part of speech classification;
  • the meaning of the original function is used to combine at least part of the words in the target text with a preset meaning
  • the semantic source in the original library is matched, and when the at least part of the word has a matching semantic source, each character of the at least part of the word is given a corresponding type attribute of the semantic source in the semantic source library
  • the score of the part-of-speech classification; the conditional random field model is used to determine the part-of-speech classification of each word according to the total score of each word in the target text corresponding to each part-of-speech classification, where any word corresponds to the total score of any part-of-speech classification
  • the score is the sum
  • the named entity in the target text is determined according to the part-of-speech classification of the word to be classified.
  • each of the template functions has a feature template and a plurality of feature functions, and each of the feature templates is defined with an extracted bit having a certain positional relationship with the current bit, and any two of the template functions are The feature template is different;
  • Each of the template functions is used to sequentially make each word in the target text in the current position; when any word is in the current position, each feature function is used to determine the match between the word in the extracted position and the preset text specified by itself In case, a preset score corresponding to a preset part-of-speech classification is given for the character in the current position.
  • the extraction bits specified by each of the feature templates are any one of the following;
  • C-2 C-1, C0, C1, C2, C-2C-1, C-1C0, C0C1, C1C2, C-3C-2C-1, C-2C-1C0, C-1C0C1, C0C1C2, C1C2C3;
  • Cn represents the position of the last n words of the current bit
  • C-n represents the position of the first n words of the current bit
  • n is any one of 0, 1, 2, and 3.
  • conditional random field model further includes a score function for determining whether a word is a punctuation, a score function for numbers, a score function for letters, a score function for the beginning of a sentence, and a score function for the end of a sentence.
  • the score function whether it is a score function for common suffixes, and whether it is a score function for common prefixes.
  • the determining the part-of-speech classification of each character according to the total score of each word in the target text corresponding to each part-of-speech classification includes:
  • the part-of-speech classification corresponding to the maximum total score is used as the part-of-speech classification of the character.
  • conditional random field model further includes a word segmentation function, and the word segmentation function is used to segment a plurality of words to be matched from the target text;
  • the semantic origin function is used to match each of the words to be matched with the semantic origin in the preset semantic origin library, and when the word to be matched has a matching origin, it is the word to be matched
  • Each character of gives the score of the part-of-speech classification corresponding to the type attribute of the semantic source in the semantic source database.
  • the determining the named entity in the target text according to the part-of-speech classification of the word to be classified includes:
  • the words to be classified in the part-of-speech classification of the predetermined field are extracted as named entities.
  • the predetermined field is a medical field.
  • the obtaining the target text includes:
  • the method further includes:
  • the method before the determining the word to be classified and the part-of-speech classification of the word to be classified in the target text according to the preset conditional random field model, the method further includes:
  • the conditional random field model is trained using training text.
  • embodiments of the present disclosure provide a method for establishing a named entity dictionary, including:
  • a named entity dictionary is established.
  • a named entity recognition device which includes:
  • Get module configured to get target text
  • the classification module is configured to determine the word to be classified in the target text and its part-of-speech classification according to a preset conditional random field model; wherein the conditional random field model includes a plurality of score functions, and the score function includes a semantic function And at least one template function; each of the template functions is used to give the score of each word in the target text corresponding to each part-of-speech classification; the semantic function is used to combine at least part of the words in the target text Matches with the sense source in the preset sense source library, and when the at least part of the word has a matching sense source, each character of the at least part of the word is given with the sense source in the sense source library
  • the score of the part-of-speech classification corresponding to the type attribute; the conditional random field model is used to determine the part-of-speech classification of each word according to the total score of each word in the target text corresponding to each part-of-speech classification, where any word corresponds to The total score of any part-of-
  • the determining module is configured to determine the named entity in the target text according to the part-of-speech classification of the word to be classified.
  • the determining module is configured to extract a word to be classified in a part-of-speech classification of a predetermined field as a named entity.
  • an electronic device which includes:
  • One or more processors are One or more processors;
  • the memory has one or more programs stored thereon, and when the one or more programs are executed by the one or more processors, the one or more processors enable any one of the aforementioned named entity recognition methods.
  • embodiments of the present disclosure provide a computer-readable medium on which a computer program is stored, and when the program is executed by a processor, any one of the aforementioned named entity recognition methods is implemented.
  • FIG. 1 is a flowchart of a method for named entity recognition provided by an embodiment of the disclosure
  • FIG. 3 is a flowchart of a method for establishing a named entity dictionary provided by an embodiment of the present disclosure
  • FIG. 4 is a block diagram of a device for recognizing named entities provided by an embodiment of the disclosure.
  • FIG. 5 is a block diagram of an electronic device provided by an embodiment of the disclosure.
  • FIG. 6 is a block diagram of another electronic device provided by an embodiment of the disclosure.
  • Fig. 7 is a block diagram of a computer-readable medium provided by an embodiment of the present disclosure.
  • the method and device for recognizing named entities provided by the embodiments of the present disclosure, the method for establishing a named entity dictionary, the electronic equipment, and the computer-readable medium.
  • Text which is a collection of content that can express a certain meaning in linguistics, and is mainly composed of text (and other characters such as punctuation); for example, a sentence, an article, a book, a text webpage, etc., Both can be a "text".
  • Wild which is also called a character, is the most basic unit that constitutes a text. Specifically, it can be Chinese characters, letters, numbers, symbols, punctuation, etc.
  • “Word” refers to a “character set” that is composed of one "character (character)” or a plurality of consecutive “characters (character)", which can express relatively independent meaning in linguistics.
  • Yiyuan in linguistics, is the smallest semantic unit that can independently indicate a certain meaning, the most basic, not easy to subdivide, or the most fine-grained "word".
  • "person” is a very A complex concept, it is a collection of multiple attributes, but it can also be regarded as a source of meaning.
  • all concepts can be decomposed into various meanings, and a limited set of meanings (yiyuan library) can be obtained.
  • many meanings are also "named entities”.
  • the relationship between different "words (named entities)" can be understood as a tree structure. The last branch of the tree structure cannot be further divided into A Yoshihara.
  • the medical field refers to the range of all content that has a strong relevance to medical technology, and more specifically can include content related to diseases (such as disease types, causes, symptoms, treatments, etc.), and content related to treatment (Such as treatment methods, treatment equipment, treatment drugs), etc., content related to prevention/exercise (such as prevention/exercise methods, treatment/exercise equipment, treatment/exercise drugs, etc.), content related to specific medical concepts (such as Doctors, medical organizations, medical history, etc.).
  • diseases such as disease types, causes, symptoms, treatments, etc.
  • content related to treatment Seuch as treatment methods, treatment equipment, treatment drugs), etc.
  • content related to prevention/exercise such as prevention/exercise methods, treatment/exercise equipment, treatment/exercise drugs, etc.
  • content related to specific medical concepts such as Doctors, medical organizations, medical history, etc.
  • Knowledge graph which is a collection of data representing the relationship between different entities and the attributes of entities; in the knowledge graph, entities are nodes; between entities, entities and their corresponding attributes, attributes and their corresponding values They are connected by edges to form a structured, network-like database.
  • the embodiments of the present disclosure provide a method for named entity recognition.
  • the method of the embodiment of the present disclosure can be used to identify at least part of the named entities in the text and determine the part-of-speech classification of these named entities.
  • the method for recognizing a named entity in an embodiment of the present disclosure may specifically include the following steps S101 to S103.
  • the target text for input can be obtained through input devices such as a mouse, a keyboard, a voice input unit, a text recognition unit, and a scanning unit; or, the target text can be directly read from a storage medium such as a reserved hard disk.
  • input devices such as a mouse, a keyboard, a voice input unit, a text recognition unit, and a scanning unit; or, the target text can be directly read from a storage medium such as a reserved hard disk.
  • this step (S101) may specifically include: crawling the target text from the network.
  • the target text can be crawled from the content of the network (for example, a predetermined range of the network).
  • a specific target text can be obtained by crawling, for example, the content of the encyclopedia knowledge entry named "disease name” that is crawled is used as the target text.
  • the content of the encyclopedia knowledge entry named "disease name” is used as the target text.
  • most of the content of the encyclopedia knowledge entry named "disease name” is related to the "disease" of its name, such as the alias of the disease, the cause of the disease, and the treatment method.
  • the content of the target text can be "semi-structured", and the target text can be divided into multiple parts (such as paragraphs, chapters, etc.) according to subheadings such as aliases, etiology, and treatment methods.
  • target text may also be unstructured data (for example, an article without a subtitle).
  • the target texts obtained in this step (S101) are target texts belonging to the same predetermined field.
  • the target text obtained from a specific source can be limited to ensure that the obtained target text belongs to the same predetermined field.
  • the content of encyclopedia knowledge of a certain website is obtained by crawling as the target text
  • the content of "medical encyclopedia knowledge (as determined by the classification label given by the website for encyclopedia knowledge entry)" can be restricted to crawl as the target text.
  • Target text in the medical field can be restricted to crawl as the target text.
  • the urllib function in the Python language can be used to grab the web page
  • the Beautiful Soup framework can be used to obtain the corresponding text content from the web page (such as HTML, XML format) and save it (such as The content of each webpage is saved as a file) as the target text.
  • S102 Determine the word to be classified in the target text and its part-of-speech classification according to a preset conditional random field model.
  • the conditional random field model includes multiple scoring functions, the scoring function includes a semantic function and at least one template function; each template function is used to give the score of each word in the target text corresponding to each part of speech classification; the semantic function It is used to match at least part of the words in the target text with the semantic source in the preset sense source library, and when the word has a matching sense source, give each word of the word the meaning source in the source library
  • the score of the part-of-speech classification corresponding to the type attribute in;
  • the conditional random field model is used to determine the part-of-speech classification of each word according to the total score of each word in the target text corresponding to each part-of-speech classification, where any word corresponds to any part-of-speech classification
  • the total score is the sum of the scores of the word corresponding to the part-of-speech classification given by all the scoring functions; the conditional random field model is also used to determine multiple words with the same part-of-speech classification as a word to be
  • a preset conditional random field (CRF, Conditional Random Field) model is used to process the target text to determine the score of each word in the target text corresponding to each part of speech classification (it can reflect that each word belongs to each part of speech classification ), and then determine the part-of-speech classification of each word according to the above score; wherein at least part of the continuous characters in the target text have the same part-of-speech classification and form the word to be classified, and the word to be classified has the part-of-speech classification ( Because the part-of-speech classification of each character in the word to be classified is the same), the named entity and its part-of-speech classification are obtained.
  • CRF Conditional Random Field
  • the part-of-speech classification is used to indicate the "property" of the corresponding word
  • the part-of-speech classification of a word indicates the "property” of the word to which the word belongs, so the part-of-speech classification is equivalent to the word's "type", “attribute", and "tag” "Wait.
  • the specific types of part-of-speech classification can be preset according to needs.
  • part-of-speech classification can also be set according to other content.
  • the part-of-speech classification can be determined according to the linguistic attributes of words.
  • the part-of-speech classification can include "noun”, “verb”, “preposition”, “quantifier”, “adjective” and so on.
  • the part-of-speech classification can also be determined according to the meaning of the word in reality.
  • the part-of-speech classification can include "person name”, “place name”, “organization name”, “scientific concept”, “disease name”, etc.
  • the part-of-speech classification may also be determined according to the technical field to which the word belongs.
  • the part-of-speech classification may include "medical field”, “industrial field”, “construction field”, and so on.
  • non-entity part-of-speech classification can be set, and all other words are considered to be “non-entity” part-of-speech classification .
  • CRF Conditional Random Field
  • conditional random field model can include multiple scoring functions, and each scoring function can calculate the score of each word (character) in the text belonging to each "possible type (part of speech)", and the score can reflect the word The probability of belonging to the corresponding part-of-speech classification (but not necessarily equal to the probability value), so each scoring function is essentially to determine the "possibility" of the word belonging to each part-of-speech classification.
  • conditional random field model can determine the part-of-speech classification of each word according to the total score of each word corresponding to each part-of-speech classification, and use consecutive words with the same part-of-speech classification as a word to be classified to determine the word to be classified
  • the part-of-speech classification is the part-of-speech classification of any one of the characters (the part-of-speech classification of all characters in the word to be classified must be the same).
  • each scoring function can determine that at least some of the characters belong to at least part of the part of speech category, so the score for each word corresponding to each part of speech category may be determined by multiple score functions Given together, so for each part of speech classification of each word, the "total" of the scores given by all scoring functions is the “total score” of the word corresponding to the part of speech classification, and the total score also reflects the conditional random field
  • the overall model judges the probability that the character belongs to the part-of-speech classification.
  • determining the part-of-speech classification of each word includes: determining the largest total score among all the total scores corresponding to each word in the target text ; Use the part-of-speech classification corresponding to the maximum total score as the part-of-speech classification of the character.
  • the part-of-speech classification corresponding to the largest total score of the multiple part-of-speech classifications of each character can be used as the part-of-speech classification of the character.
  • conditional random field model has 3 scoring functions (2 template functions and 1 semantic function, which will be explained in detail later), and the part-of-speech classification has a total of "disease name" and "
  • the score is given according to the conditional random field model, and the process of determining the word to be classified and the part-of-speech classification according to the score can be as follows:
  • the score function is divided into a template function and a semantic function.
  • the template function is a scoring function used in the conditional random field model of some related technologies, and each template function is used to give each word in the target text a score that belongs to each part of speech classification.
  • each template function has a feature template and a plurality of feature functions, and each feature template is specified with an extracted bit having a certain positional relationship with the current bit.
  • the feature templates of any two template functions are different; each template function It is used to make each word in the target text in the current position in turn; when any word is in the current position, each feature function is used to judge the match between the word in the extracted position and the preset text specified by itself, which is the current position
  • the character gives a preset score corresponding to a preset part-of-speech classification.
  • the template function corresponds to the feature template, and each feature template specifies an "extraction bit” that satisfies a certain positional relationship with the "current bit”, and different template functions have different extraction bit settings.
  • each template function will sequentially use each word in the target text as the "current position", or use the current position of the feature template to "traverse" each word in the target text.
  • the word located in the "extraction position” in the target text can be found according to the feature template.
  • the traversal process for the text "children are more likely to suffer from chickenpox” can be as follows:
  • the part-of-speech classification of each word in the text has a certain "probabilistic" relationship with the words before and after it.
  • the first word after the word “suffering” is usually (but not absolutely) the name of a disease
  • the first word of "suffering”, the word after the word “suffering” has a higher probability of being classified as "disease name" by part of speech.
  • the template function can "predict” the probability that the output sequence is in various situations, that is, predict the sequence formed by the part-of-speech classification of the current character.
  • each template function includes multiple characteristic functions.
  • the feature function includes "preset text”, “preset score”, and “preset part of speech classification", which are all obtained through training.
  • each feature function judges whether the word of the "extracted position" matches its own “preset text” (the text is the same), if it matches, it is the "preset part of speech classification" of the current word Give the "preset score”.
  • the total score given by each template function for each part of speech classification of each character is the sum of the preset scores given by all the feature functions for their respective preset part of speech classification.
  • the corresponding feature functions may include:
  • Feature function 1 the default text is "suffering” and the default part-of-speech classification is "disease name", it is used to give a certain correspondence to the current character when the first two characters of the current position are “suffering” The score of the part-of-speech classification of "disease name”. Because "suffering from (the name of a certain disease)" is a common description.
  • the default text is "group as”
  • the preset part-of-speech classification is "diseased population”
  • it is used to give a certain correspondence to the current character when the first two characters of the current position are “group as”
  • the above template function may also include feature function 3, which is used to give the current word a certain score corresponding to the part-of-speech classification of "non-entity" when the first two characters of the current position are "suffering".
  • each feature function itself can only be used to give the original probability of 0 (unmatch) or 1 (match) according to whether the word of the extracted bit matches the preset text, and in training, different feature functions can obtain different The weight of (described in detail later), so that the preset score of the feature function is the product of its original probability and the weight.
  • the above weight can be a "negative number", which is used to indicate that when the extracted position is a preset character, the probability that the current character is classified as a preset part of speech is negative, that is, the preset character represents the current character "impossible” For the preset part-of-speech classification.
  • the feature function can be considered as a score for the preset part-of-speech classification, or it can be considered as a score of 0, so it does not affect the actual part-of-speech classification. Score.
  • the weight of a certain feature function actually reflects the size of the preset text in the feature function and the “probability” of the preset part-of-speech classification, that is, when the extracted word is the preset text, the current word is Preset the probability of part-of-speech classification.
  • each template function of the conditional random field model is equivalent to a conditional probability distribution model P(Y
  • the conditional probability distribution model means that given a set of input random variables X, another set of output random variables Y's Markov random field.
  • the characteristic of the template function is to assume that the output random variables constitute a Markov random field. Therefore, the specific formula of the template function can be as follows:
  • i represents the number of the current bit
  • k represents the number of the feature function
  • f represents the original probability (0 or 1) given by the feature function
  • represents the weight of the feature function
  • Z(O) is used to achieve normalization, so that The result is a probability value between 0-1 (of course it is optional)
  • P(I/O) represents the probability (score) of the hidden state sequence I obtained under the condition of a given observation sequence O.
  • the extraction bits specified by each feature template are any one of the following:
  • C-2 C-1, C0, C1, C2, C-2C-1, C-1C0, C0C1, C1C2, C-3C-2C-1, C-2C-1C0, C-1C0C1, C0C1C2, C1C2C3;
  • Cn represents the position of the last n words of the current bit
  • C-n represents the position of the first n words of the current bit
  • n is any one of 0, 1, 2, and 3 (C0 indicates that the extracted bit is the current bit).
  • the characteristic template of the template function in the conditional random field model of the embodiment of the present disclosure may include one or more of the above.
  • C-2, C-1, C0, C1, C2 are unigrams.
  • the "extraction bit" is only one word.
  • C-2C-1, C-1C0, C0C1, C1C2 are binary templates (Bigram), that is, their "extraction bits” are two words.
  • Bigram binary templates
  • C-3C-2C-1, C-2C-1C0, C-1C0C1, C0C1C2, C1C2C3 are ternary templates (Trigram), that is, their "extraction bits” are three words.
  • Trigram ternary templates
  • the "extracted bit” may include the “current bit”, or the “extracted bit” may also be the "current bit”.
  • the extracted bit words may be the same according to the feature templates of different template functions.
  • “diabetes” is a common disease name, so the template function of the feature template of C0 is used.
  • the word “sugar” When the current position (and the extracted position) is the word “sugar”, the word “disease name” with the higher word “sugar” may be given The score of the part-of-speech classification.
  • the scoring function of the conditional random field model also includes a semantic source function.
  • the semantic source function does not use a feature template and a feature function, but is based on a word (such as a word to be matched) in the target text and a preset meaning.
  • the matching situation of the original meaning in the original library, and the score of each character in the word corresponding to the specific part of speech classification is given.
  • the semantics function does not give them a score (it can also be regarded as giving a score of 0).
  • the "sense source library” is a pre-summarized, including a large number of known semantic source databases (such as the semantic source library of a specific website), such as the Hownet knowledge base.
  • the semantic source library includes a large number of semantic sources, as well as additional information related to each semantic source.
  • the additional information may include the source code (ID), the translation of the semantic source (such as English translation), the type attribute (Father information), etc. . Taking the meaning of "twitch" as an example, the additional information may include: Father information: Ill (No. 105), ID: 113, English: twitch.
  • the above additional information includes the "type attribute (Father information) of Yiyuan, that is, the "type”, “attribute”, “label”, etc., predetermined for Yiyuan in the Yiyuan library.
  • type The attribute
  • the attribute may have a certain corresponding relationship with the "part of speech classification", and the corresponding relationship may be manually set in advance.
  • type attribute and “part-of-speech classification” can be one-to-one correspondence (of course, the name of the type attribute may not be directly the same as the name of the part-of-speech classification); alternatively, multiple “type attributes” can correspond to the same “part-of-speech classification” ; Or, each "type attribute” may correspond to multiple "part of speech classifications", etc., which will not be described in detail here.
  • type attribute may also include “disease name”; or, "type attribute” may include “morbidity”, and the "type attribute" of "morbidity” is preset to correspond to the "part of speech classification" of "common symptoms”.
  • conditional random field model further includes a word segmentation function, which is used to divide a plurality of words to be matched from the target text; the semantic source function is used to combine each word to be matched with a preset semantic source library Matches the semantics of the semantics, and when the word to be matched has a matching semantics, a score of the part-of-speech classification corresponding to the type attribute of the semantics in the semantic database is given for each character of the to-be-matched word.
  • word segmentation function which is used to divide a plurality of words to be matched from the target text
  • the semantic source function is used to combine each word to be matched with a preset semantic source library Matches the semantics of the semantics, and when the word to be matched has a matching semantics, a score of the part-of-speech classification corresponding to the type attribute of the semantics in the semantic database is given for each character of the to-be-matched word.
  • the target text can be segmented through the preset word segmentation function, and multiple words with relatively independent meanings can be divided from the target text, that is, multiple "words to be matched", and these "words to be matched” "Matches Yoshihara.
  • the words to be matched classified at this time are not necessarily all final words to be classified.
  • "diabetes” may be classified as the word to be matched in the word segmentation, and the text of "diabetes” is located in the "type 1 diabetes” in the target text, so in the end it may be that "type 1 diabetes” is determined as a "to be classified” word”.
  • the word to be matched When the word to be matched is determined, it can be searched in the semantic source database to see if there is a semantic source that is exactly the same as the word to be matched (that is, matching). If a word to be matched has a matching semantic source, continue to search for the source. Whether the "type attribute" of Yiyuan has a corresponding "part of speech classification", if so, a score corresponding to the above "part of speech classification" is directly given to each character in the word to be matched.
  • the target text For example, if the to-be-matched word obtained by segmenting the target text includes "diabetes", and the semantic source of "diabetes" is in the semantic source database, and its type attribute corresponds to the part-of-speech classification of "disease name", then the target text The three characters of "sugar, urine, and disease", each 0.5 corresponds to the score of the part-of-speech classification of "disease name”.
  • the sense source By setting the "sense source function", the sense source can be used to explore the semantic features of related words.
  • the part-of-speech classification of "type 2 diabetes” is “disease name", and "diabetes” is the source of meaning, and the part-of-speech classification corresponding to its "attribute type” is also "disease name”, so the target text In the "type 1 diabetes", the semanteme function can give a higher score for the part of speech classification corresponding to the "disease name” for the "diabetes", combined with other scoring functions such as the template function, the conditional random field model has a larger It is possible to judge "type 1 diabetes" in the target text as "disease name".
  • conditional random field model can also include some other scoring functions.
  • the conditional random field model can include a scoring function for judging whether a word is a punctuation (Punctuation: IsPunc), a scoring function for a number (Digits: Isdigits), a scoring function for a letter (Alphabets: IsAlpha), and whether it is a sentence beginning
  • the scoring function (Position of Char: Bos), the scoring function of the end of the sentence (Position of Char: Eos), the scoring function of common suffixes (Common Suffix), the scoring function of common prefixes (Common Prefix), etc. .
  • scoring functions can also affect the final processing result. For example, if a certain scoring function above judges that a word is "punctuation", then the word can be given a high "punctuation" part-of-speech classification score, so as to ensure that the word will not be considered as a named entity. Or, if one of the above score functions determines that a word is the "sentence beginning", when determining the word to be classified, it is fixed that the word and the previous word cannot be a word to be classified.
  • the above scoring functions can be designed according to the sklearn_crfsuite function of the Python language, and will not be described in detail here.
  • S103 Determine a named entity in the target text according to the part-of-speech classification of the word to be classified.
  • part-of-speech classification of the words to be classified in the target text obtained above it is determined which parts of the target text are named entities, and the part-of-speech classification corresponding to these named entities, as the recognition result.
  • the recognition result is a csv (comma separated value) file, where the target text is a column, and each row in this column is a word (character) of the target text; and the part of speech is classified into another column, and each behavior in the column is related to the target.
  • the part-of-speech classification corresponding to the word in the line of the text.
  • the specific storage form of the recognition result corresponding to the content of "for diabetes" in the target text Can be as follows:
  • the matching status between the word and the "Yiyuan” and the type attribute of the "Yiyuan” are used as one of the reference factors, and the "Yiyuan” is known and has definite meaning.
  • Words, their type attributes are also predetermined. Therefore, firstly, the positional relationship of the meaning source can be used to improve the abstraction of keywords in order to reduce the number of feature templates (template function); secondly, the semantic source can be used to explore the semantic features of related words; thirdly, the meaning source can be clearly directed to the word corpus (meaning The original library) helps to determine the type of naming (part of speech classification).
  • the embodiments of the present disclosure can improve the abstraction of feature templates and reduce the number of feature templates; thus, on the one hand, the speed of calculation and training can be improved; on the other hand, the accuracy of named entity recognition can be improved.
  • the target text recognized by the method is manually recognized, and it is found that the recognition accuracy of the method of the embodiment of the present disclosure can answer more than 90%.
  • determining the named entity in the target text according to the part-of-speech classification of the word to be classified includes: extracting the word to be classified in the part-of-speech classification of a predetermined field as the named entity.
  • the predetermined field is the medical field.
  • the above predetermined field may specifically be the medical field.
  • obtaining the target text includes: obtaining the target text from at least one of a medical database, a medical website, a medical paper, a medical textbook, and a medical record.
  • the content of the medical field may specifically include medical databases, medical websites, medical papers, medical textbooks, medical records, and so on.
  • the target text in the medical field may be limited to crawl only the content of "medical encyclopedia knowledge (as determined by the classification label given by the website for the encyclopedia knowledge entry)" as the target text in the medical field.
  • the method further includes:
  • the identification result can also be output for further use.
  • the above "output" methods may include displaying the recognition result, broadcasting the recognition result by voice, printing out the recognition result, and transmitting the recognition result to other devices by copying/sending, etc., which will not be described in detail here.
  • the method before determining the part-of-speech classification of the word to be classified in the target text according to the preset conditional random field model (S102), the method further includes:
  • S1001 Use training text to train the conditional random field model.
  • conditional random field model (CRF) can be obtained through training.
  • part of the text can be selected first, and each word in the text can be manually labeled with the corresponding part of speech classification, and then these texts can be used as training text to train the conditional random field model.
  • the field of the training text obtained at this time should be as same as the field of the named entity that you want to recognize.
  • texts in the medical field can be selected as training texts from the above medical databases, medical websites, medical papers, medical textbooks, and medical records.
  • the training process of the conditional random field model can be:
  • conditional random field model can be gradually improved until the end condition is met, and the training ends, and a conditional random field model that can be directly applied subsequently is obtained.
  • the above end conditions are diverse.
  • the end conditions may include that the accuracy of the recognition result of the current conditional random field model reaches a predetermined value (or convergence), or yes, the end conditions may include that the number of training cycles reaches a predetermined value, etc., It will not be described in detail here.
  • each template function is generated based on the training text.
  • the word corresponding to the extracted position is also determined.
  • multiple feature functions can be generated, and each feature function’s "predetermined part-of-speech classification" It is a possible part-of-speech classification, and the "predetermined characters" of all feature functions are the same (all are the characters corresponding to the extracted bits).
  • the training process is mainly to adjust the weights of the above feature functions, that is, according to different "predetermined part-of-speech classification-predetermined characters" correspondence and number of appearances, the weights of different feature functions can be increased or decreased.
  • the conditional random field model processes the target text according to the adjusted weights, it can obtain basically the same processing results as the part of speech classification that is manually marked.
  • the weight of the original function or the score given can also be adjusted continuously during the above training process.
  • conditional random field model is a machine learning algorithm, which is equivalent to other deep learning algorithms.
  • the conditional random field model requires a small amount of training, so as long as a small amount of text is manually annotated, the training process can be completed. Easy to implement and high efficiency.
  • Conditional Random Field Model is a "copy (the copy itself is not trained)" that is copied by copying the trained conditional random field model stored in other storage media, it is also feasible .
  • the named entities after the named entities are obtained, it is possible to continue to classify these named entities and their corresponding parts of speech to further determine the relationship between different named entities and establish a knowledge graph of the named entities.
  • the named entity obtained above can be used as the "entity” in the knowledge graph
  • the part-of-speech classification of the named entity can be used as the "attribute" of the entity in the knowledge graph, so as to determine which entities are in the knowledge graph through extraction, and What is the relationship between the entities (such as having the same attributes), and the knowledge graph is obtained.
  • each knowledge graph may be a knowledge graph of a predetermined field (such as a medical field), that is, entities (named entities) and attributes (part-of-speech classification) therein are all related to the predetermined field.
  • a predetermined field such as a medical field
  • an embodiment of the present disclosure provides a method for establishing a named entity dictionary, including:
  • the above “determination” can be to manually specify some target texts to be processed, or to obtain some target texts to be processed by means of crawling or the like.
  • multiple named entities are determined from multiple target texts, and the part-of-speech classifications of these named entities are determined.
  • each word in the named entity dictionary may have the same part-of-speech classification.
  • the named entity dictionary can only store the named entity (that is, the word), not the part-of-speech classification (that is, the attribute of the word), but through the named entity
  • the name of the dictionary, etc. uniformly indicates the part-of-speech classification of the named entity.
  • the part-of-speech classification of words in each named entity dictionary may also belong to a predetermined field.
  • the part of speech can be classified into "name of disease”, “people affected”, “diseased location”, “cause”, “common symptom”, “office department”, “route of transmission”, “physiological index”, “test method”
  • Named entities in the medical field such as "”, “treatment means” and so on are added to a named entity dictionary to obtain a “medical dictionary”.
  • the part-of-speech classification of words in each named entity dictionary may also belong to multiple different fields, that is, the named entity dictionary may be a “comprehensive dictionary”.
  • the named entity dictionary can store both the named entity (that is, the word) and its part-of-speech classification (that is, the attributes of the word) at the same time.
  • sampling and testing of the named entity dictionary established according to the method of the embodiment of the present disclosure has an accuracy rate of more than 90%.
  • accuracy rate of more than 90%.
  • manual data verification is required.
  • the method of the embodiments of the present disclosure can realize basically accurate identification.
  • an embodiment of the present disclosure provides a named entity recognition device, which includes:
  • Get module configured to get target text.
  • the classification module determines the word to be classified in the target text and its part-of-speech classification according to a preset conditional random field model; wherein, the conditional random field model includes a plurality of scoring functions, and the scoring function includes a semantic function and at least A template function; each of the template functions is used to give the score of each word in the target text corresponding to each part of speech classification; the meaning of the original function is used to compare at least part of the words in the target text with the pre- Suppose the semantic primitives in the semantic primitive library match, and when the word has a matching semantic primitive, give each character of the word a part-of-speech classification corresponding to the type attribute of the semantic primitive in the semantic primitive database
  • the conditional random field model is used to determine the part-of-speech classification of each word according to the total score of each word in the target text corresponding to each part-of-speech classification, wherein the total score of any word corresponding to any part-of-speech classification is The sum of the scores of the word
  • the determining module is configured to determine the named entity in the target text according to the part-of-speech classification of the word to be classified.
  • the device for recognizing a named entity in the embodiment of the present disclosure can implement the above method for recognizing a named entity.
  • the determining module is configured to extract words to be classified in a part-of-speech classification of a predetermined field as named entities.
  • the determining module of the device for recognizing named entities in the embodiments of the present disclosure may specifically also extract named entities in a predetermined field (such as the medical field).
  • an electronic device which includes:
  • One or more processors are One or more processors;
  • the memory has one or more programs stored thereon.
  • the one or more processors implement any one of the aforementioned named entity recognition methods.
  • the electronic device of the embodiment of the present disclosure can implement the above named entity recognition method.
  • the electronic device of the embodiment of the present disclosure further includes one or more I/O interfaces, which are connected between the processor and the memory, and are configured to implement information interaction between the processor and the memory.
  • an I/O interface can also be set to realize the data interaction between the processor and the memory.
  • the processor is a device with data processing capabilities, including but not limited to a central processing unit (CPU), etc.
  • the memory is a device with data storage capabilities, including but not limited to random access memory (RAM, more specifically such as SDRAM). , DDR, etc.), read-only memory (ROM), charged erasable programmable read-only memory (EEPROM), flash memory (FLASH);
  • the I/O interface read and write interface
  • the information exchange of the processor includes, but is not limited to, a data bus (Bus), etc.
  • an embodiment of the present disclosure provides a computer readable medium on which a computer program is stored, and when the program is executed by a processor, any one of the aforementioned named entity recognition methods is implemented.
  • the embodiments of the present disclosure provide a method for implementing the above named entity recognition when a program in a computer-readable medium is executed.
  • the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, a physical component may have multiple functions, or a function or step may consist of several physical components. The components are executed cooperatively.
  • Some physical components or all physical components can be implemented as software executed by a processor, such as a central processing unit (CPU), digital signal processor, or microprocessor, or implemented as hardware, or implemented as an integrated circuit, such as Application specific integrated circuit.
  • a processor such as a central processing unit (CPU), digital signal processor, or microprocessor
  • Such software may be distributed on a computer-readable medium, and the computer-readable medium may include a computer storage medium (or a non-transitory medium) and a communication medium (or a transitory medium).
  • the term computer storage medium includes volatile and non-volatile memory implemented in any method or technology for storing information (such as computer-readable instructions, data structures, program modules, or other data). Sexual, removable and non-removable media.
  • Computer storage media include, but are not limited to, random access memory (RAM, more specifically SDRAM, DDR, etc.), read-only memory (ROM), charged erasable programmable read-only memory (EEPROM), flash memory (FLASH) or other disk storage ; CD-ROM, digital versatile disk (DVD) or other optical disk storage; magnetic cassette, tape, magnetic disk storage or other magnetic storage; any other that can be used to store desired information and can be accessed by a computer medium.
  • a communication medium usually contains computer-readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave or other transmission mechanism, and may include any information delivery medium. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

命名实体识别的方法包括:获取目标文本(S101);根据预设的条件随机场模型确定目标文本中的待分类词及其词性分类(S102);条件随机场模型包括义原函数和模板函数;每个模板函数用于给出目标文本中的每个字对应每个词性分类的得分;义原函数用于将目标文本中的至少部分词与预设的义原库中的义原匹配,并在词具有匹配的义原时,为该词的每个字给出与该义原在义原库中的类型属性对应的词性分类的得分;条件随机场模型用于根据目标文本中的每个字对应每个词性分类的总得分,确定每个字的词性分类;条件随机场模型还用于确定具有相同词性分类的多个字为一个待分类词;根据待分类词的词性分类确定目标文本中的命名实体(S103)。

Description

实体识别的方法和装置、建立词典的方法、设备、介质 技术领域
本公开实施例涉及计算机技术领域,特别涉及命名实体识别的方法和装置、建立命名实体词典的方法、电子设备、计算机可读介质。
背景技术
随着技术的发展,网络数据的量呈***式增长,且网络数据具有大规模、异质多元、组织结构松散等特点,故如何从大量的网络数据中提取出所需知识,已成为亟待解决的问题。
发明内容
本公开实施例提供一种命名实体识别的方法和装置、建立命名实体词典的方法、电子设备、计算机可读介质。
第一方面,本公开实施例提供一种命名实体识别的方法,包括:
获取目标文本;
根据预设的条件随机场模型确定所述目标文本中的待分类词及其词性分类;其中,所述条件随机场模型包括多个得分函数,所述得分函数包括义原函数和至少一个模板函数;每个所述模板函数用于给出所述目标文本中的每个字对应每个词性分类的得分;所述义原函数用于将所述目标文本中的至少部分词与预设的义原库中的义原匹配,并在所述至少部分词具有匹配的义原时,为所述至少部分词的每个字给出与该义原在所述义原库中的类型属性对应的词性分类的得分;所述条件随机场模型用于根据所述目标文本中的每个字对应每个词性分类的总得分,确定每个字的词性分类,其中,任意字对应任意词性分类的总得分为所有得分函数给出的该字对应该词性分类的得分的总和;所述条件随机场模型还用于确定具有相同词性分类的多个字为一个待分类词,并确定每个所述待分类词的词性分类为其中任意字的词性分 类;
根据所述待分类词的词性分类确定所述目标文本中的命名实体。
在一些实施例中,每个所述模板函数具有特征模板和多个特征函数,每个所述特征模板规定有与当前位具有确定位置关系的提取位,任意两个所述模板函数的所述特征模板不同;
每个所述模板函数用于依次使目标文本中的每个字处于当前位;在任意字处于当前位时,每个特征函数用于判断处于提取位的字与自身规定的预设文字的匹配情况,为处于当前位的字给出对应一个预设词性分类的预设得分。
在一些实施例中,每个所述特征模板规定的提取位为以下任意一种;
C-2、C-1、C0、C1、C2、C-2C-1、C-1C0、C0C1、C1C2、C-3C-2C-1、C-2C-1C0、C-1C0C1、C0C1C2、C1C2C3;
其中,Cn表示当前位的后n个字的位置,C-n表示当前位的前n个字的位置,n为0、1、2、3中的任意一者。
在一些实施例中,所述条件随机场模型还包括判断字是否为标点的得分函数、是否为数字的得分函数、是否为字母的得分函数、是否为句首的得分函数、是否为句尾的得分函数、是否为常见后缀的得分函数、是否为常见前缀的得分函数。
在一些实施例中,所述根据所述目标文本中的每个字对应每个词性分类的总得分,确定每个字的词性分类包括:
确定所述目标文本中的每个字对应的所有总得分中的最大总得分;
以所述最大总得分对应的词性分类为该字的词性分类。
在一些实施例中,所述条件随机场模型还包括分词函数,所述分词函数用于从目标文本中划分出多个待匹配词;
所述义原函数用于将每个所述待匹配词与预设的所述义原库中的义原进行匹配,并在所述待匹配词具有匹配的义原时,为该待匹配词的每个字给出与该义原在所述义原库中的类型属性对应的词性分类的 得分。
在一些实施例中,所述根据所述待分类词的词性分类确定所述目标文本中的命名实体包括:
抽取预定领域的词性分类的待分类词作为命名实体。
在一些实施例中,所述预定领域为医学领域。
在一些实施例中,所述获取目标文本包括:
从医学数据库、医学网站、医学论文、医学教课书、病历中的至少一者,获取所述目标文本。
在一些实施例中,在所述根据所述待分类词的词性分类确定所述目标文本中的命名实体后,还包括:
输出所述命名实体。
在一些实施例中,在所述根据预设的条件随机场模型确定所述目标文本中的待分类词以及待分类词的词性分类前,还包括:
获取训练文本,所述训练文本中的每个字具有预设的词性分类;
利用训练文本训练所述条件随机场模型。
第二方面,本公开实施例提供一种建立命名实体词典的方法,包括:
确定多个目标文本;
根据上述任意一项的命名实体识别的方法,确定所述多个目标文本中的多个命名实体;
根据所述多个命名实体,建立命名实体词典。
第三方面,本公开实施例提供一种命名实体识别的装置,其包括:
获取模块,配置为获取目标文本;
分类模块,配置为根据预设的条件随机场模型确定所述目标文本中的待分类词及其词性分类;其中,所述条件随机场模型包括多个得 分函数,所述得分函数包括义原函数和至少一个模板函数;每个所述模板函数用于给出所述目标文本中的每个字对应每个词性分类的得分;所述义原函数用于将所述目标文本中的至少部分词与预设的义原库中的义原匹配,并在所述至少部分词具有匹配的义原时,为所述至少部分词的每个字给出与该义原在所述义原库中的类型属性对应的词性分类的得分;所述条件随机场模型用于根据所述目标文本中的每个字对应每个词性分类的总得分,确定每个字的词性分类,其中,任意字对应任意词性分类的总得分为所有得分函数给出的该字对应该词性分类的得分的总和;所述条件随机场模型还用于确定具有相同词性分类的多个字为一个待分类词,并确定每个所述待分类词的词性分类为其中任意字的词性分类;
确定模块,配置为根据所述待分类词的词性分类确定所述目标文本中的命名实体。
在一些实施例中,所述确定模块配置为抽取预定领域的词性分类的待分类词作为命名实体。
第四方面,本公开实施例提供一种电子设备,其包括:
一个或多个处理器;
存储器,其上存储有一个或多个程序,当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器上述任意一种命名实体识别的方法。
第五方面,本公开实施例提供一种计算机可读介质,其上存储有计算机程序,所述程序被处理器执行时实现上述任意一种命名实体识别的方法。
附图说明
附图用来提供对本公开实施例的进一步理解,并且构成说明书的 一部分,与本公开实施例一起用于解释本公开,并不构成对本公开的限制。通过参考附图对详细示例实施例进行描述,以上和其它特征和优点对本领域技术人员将变得更加显而易见,在附图中:
图1为本公开实施例提供的一种命名实体识别的方法的流程图;
图2为本公开实施例提供的另一种命名实体识别的方法的流程图;
图3为本公开实施例提供的一种建立命名实体词典的方法的流程图;
图4为本公开实施例提供的一种命名实体识别的装置的组成框图;
图5为本公开实施例提供的一种电子设备的组成框图;
图6为本公开实施例提供的另一种电子设备的组成框图;
图7为本公开实施例提供的一种计算机可读介质的组成框图。
具体实施方式
为使本领域的技术人员更好地理解本公开实施例的技术方案,下面结合附图对本公开实施例提供的命名实体识别的方法和装置、建立命名实体词典的方法、电子设备、计算机可读介质。
在下文中将参考附图更充分地描述本公开实施例,但是所示的实施例可以以不同形式来体现,且不应当被解释为限于本公开阐述的实施例。反之,提供这些实施例的目的在于使本公开透彻和完整,并将使本领域技术人员充分理解本公开的范围。
本公开实施例可借助本公开的理想示意图而参考平面图和/或截面图进行描述。因此,可根据制造技术和/或容限来修改示例图示。
在不冲突的情况下,本公开各实施例及实施例中的各特征可相互组合。
本公开所使用的术语仅用于描述特定实施例,且不意欲限制本公开。如本公开所使用的术语“和/或”包括一个或多个相关列举条目的任何和所有组合。如本公开所使用的单数形式“一个”和“该”也意 欲包括复数形式,除非上下文另外清楚指出。如本公开所使用的术语“包括”、“由……制成”,指定存在所述特征、整体、步骤、操作、元件和/或组件,但不排除存在或添加一个或多个其它特征、整体、步骤、操作、元件、组件和/或其群组。
除非另外限定,否则本公开所用的所有术语(包括技术和科学术语)的含义与本领域普通技术人员通常理解的含义相同。还将理解,诸如那些在常用字典中限定的那些术语应当被解释为具有与其在相关技术以及本公开的背景下的含义一致的含义,且将不解释为具有理想化或过度形式上的含义,除非本公开明确如此限定。
本公开实施例不限于附图中所示的实施例,而是包括基于制造工艺而形成的配置的修改。因此,附图中例示的区具有示意性属性,并且图中所示区的形状例示了元件的区的具体形状,但并不是旨在限制性的。
技术术语说明
在本公开中,如无特殊说明,则以下技术术语应按照以下解释理解:
文本,其是能在语言学上表示一定意义的,主要由文字(还可有标点等其它字符)组成的内容的集合;例如,一句话、一篇文章、一本书、一个文字网页等,均可为一个“文本”。
“字”,其也称字符,是构成文本的最基本单位,具体可为汉字、字母、数字、符号、标点等。
“词”,其是指由一个“字(字符)”或多个连续的“字(字符)”构成的,能在语言学上表示相对独立的意义的“字的集合”。
义原,在语言学中,其是能独立表明确定意义、最基本的、不易于再分割的、最小的语义单位,或者说是最细粒度的“词”,例如“人”虽然是一个非常复杂的概念,它是多种属性的集合体,但也可以把它看作为一个义原。理论上可将所有概念都分解成各种各样的义原,从 而得到有一个有限的义原集合(义原库)。显然,应当理解,许多义原同时也是“命名实体”,例如,可以将不同“词(命名实体)”的关系理解成树形结构,树形结构的最后一个分支不能再细化分割,则属于一个义原。例如,生物包括动物,动物包括人和兽,而鱼属于兽,由于鱼不可再细化分割,则鱼就是一个义原。义原不仅可以解决词义消歧的问题,更细致地捕捉到词与词之前的关系,而且可以更好地预测到一个词后下一个词以怎样的方式出现,在义原层面的关系,包含更丰富的语义。
医学领域,其是指与医学技术具有较强的相关性的全部内容构成的范围,更具体可包括与疾病相关的内容(如疾病的种类、病因、症状、治疗等),与治疗相关的内容(如治疗方法、治疗设备、治疗药物)等,与预防/锻炼相关的内容(如预防/锻炼方法、治疗/锻炼设备、治疗/锻炼药物等),与医学上的具体概念相关的内容(如医生、医疗组织、医学发展史等)。
知识图谱,其是表示不同实体间的关系以及实体的属性的数据的集合;在知识图谱中,以实体为节点;实体与实体之间,实体与其对应的属性之间,属性与其对应的值之间通过边相连,从而构成结构化的、网络状的数据库。
在一些相关技术中,需要从大段描述疾病的自然语言文本中提取关键的症状信息、检查手段等,若利用人工提取过于耗费精力与金钱投入,而且数据量太且不断更新,人工提取显然不够现实。而利用机器提取,必然需要标注好这些语料信息,才能由机器学习其中规律进行关键信息的识别;考虑人力成本,标注的数据量必然不能很大,这种情况下深度学习算法的性能并不好,因为深度学习算法需要大量的标注数据来使机器完美地理解它。
第一方面,本公开实施例提供一种命名实体识别的方法。
本公开实施例的方法,可用于识别出文本中的至少部分命名实体, 并确定这些命名实体的词性分类。
参照图1,本公开实施例的命名实体识别的方法具体可包括以下步骤S101至步骤S103。
S101、获取目标文本。
获得需要进行后续识别的文本,即目标文本。
其中,“获取”的具体方式是多样的,只要能得到目标文本以供后续处理即可。
例如,可通过鼠标、键盘、语音录入单元、文字识别单元、扫描单元等输入设备,获取用于输入的目标文本;或者,也可从预订的硬盘等存储介质中,直接读取得到目标文本。
在一些实施例中,本步骤(S101)具体可包括:从网络中爬取得到目标文本。
即可通过爬虫软件,从网络(例如预定范围的网络)的内容中,爬取得到目标文本。
例如,具体可在某网站的百科知识部分,通过爬取方式获取特定的目标文本,例如爬取的名称为“疾病名”的百科知识词条的内容,作为目标文本。显然,名称为“疾病名”的百科知识词条,内容多数都是与其名称的“疾病”相关的,例如疾病的别名、病因、治疗方法等。
其中,目标文本的内容可以是“半结构化的”,例目标文本可按照别名、病因、治疗方法等小标题,分为多个部分(如段落、章节等)。
当然,目标文本的具体形式是多样的,例如,目标文本也可为非结构化数据(例如无小标题的文章)。
在一些实施例中,本步骤(S101)获取的目标文本,为属于同一个预定领域的目标文本。
可通过某些方式,确保获取的目标文本,都是与某个特定的领域比较相关的。
具体的,可限定从特定的来源获取的目标文本,以保证获取的目 标文本均属于同一个预定领域。例如,当通过爬取方式获得某网站的百科知识的内容作为目标文本时,可限定仅爬取“医学百科知识(如通过网站为百科知识词条给出的分类标签确定)”的内容,作为医学领域的目标文本。
当然,本公开实施例中,获取和保存目标文本的具体数据处理方式都是多样的。
例如,当通过爬取方式得到目标文本时,可用Python语言中的urllib功能抓取网页,并利用Beautiful Soup框架从网页(如HTML、XML格式)中获取对应的文本内容并保存下来(如将每个网页的内容保存为一个文件),作为目标文本。
S102、根据预设的条件随机场模型确定目标文本中的待分类词及其词性分类。
其中,条件随机场模型包括多个得分函数,得分函数包括义原函数和至少一个模板函数;每个模板函数用于给出目标文本中的每个字对应每个词性分类的得分;义原函数用于将目标文本中的至少部分词与预设的义原库中的义原匹配,并在词具有匹配的义原时,为该词的每个字给出与该义原在义原库中的类型属性对应的词性分类的得分;条件随机场模型用于根据目标文本中的每个字对应每个词性分类的总得分,确定每个字的词性分类,其中,任意字对应任意词性分类的总得分为所有得分函数给出的该字对应该词性分类的得分的总和;条件随机场模型还用于确定具有相同词性分类的多个字为一个待分类词,并确定每个待分类词的词性分类为其中任意字的词性分类。
本步骤(S102)中,利用预设的条件随机场(CRF,Conditional Random Field)模型处理目标文本,以确定目标文本中每个字对应各词性分类的得分(可体现每个字属于各词性分类的概率),进而根据以上得分确定每个字的词性分类;其中,目标文本中至少部分连续的字具有相同的词性分类,并组成待分类词,且待分类词具有其中任意字的词性分类(因为待分类词中每个字的词性分类相同),从而得到命 名实体和其词性分类。
其中,词性分类用于表示对应的词所具有的“性质”,而字的词性分类表示该字所属的词的“性质”,从而词性分类相当于词的“类型”、“属性”、“标签”等。而词性分类的具体种类,可根据需要预先设置。
例如,若要识别医学领域的命名实体时,可设置“疾病名”、“发病人群”、“患病部位”、“病因”、“常见症状”、“就诊科室”、“传播途径”、“生理指标”、“检查手段”、“治疗手段”等词性分类。例如,词“胃炎”的词性分类为“疾病名”。
而当要识别其它领域的命名实体时,也可相应设置与其它领域相关的具体词性分类。
当然,根据需要的不同,词性分类也可根据其它的内容设置。
例如,词性分类可根据词在语言学上的属性确定,如词性分类可包括“名词”、“动词”、“介词”、“量词”、“形容词”等。
或者,词性分类也可根据词在现实中的含义确定,如词性分类可包括“人名”、“地名”、“组织名”、“科学概念”、“疾病名”等。
或者,词性分类也可根据词所属的技术领域确定,如词性分类可包括“医学领域”、“工业领域”、“建筑领域”等。
应当理解,对不属于需要识别的命名实体的内容,可也给它们设置详细的词性分类(如副词、介词等);但从降低条件随机场(CRF)模型的复杂程度、节约运算量的角度考虑,可以是将不属于需识别的命名实体的其它内容,均归为“非实体”词性分类。
例如,要识别医学领域的命名实体时,除了以上“疾病名”、“发病人群”等词性分类,可设置“非实体”词性分类,而认为其它所有的词,均为“非实体”词性分类。
其中,“条件随机场(CRF)模型”是一种判别式概率模型,用 于标注或分析序列资料,如标注自然语言文字(目标文本)。
具体的,条件随机场模型可包括多个得分函数,每个得分函数可计算出文本中的每个字(字符)属于每个“可能的类型(词性分类)”的得分,该得分可体现字属于相应词性分类的概率(但不一定等于概率值),故各得分函数实质上是确定字属于各词性分类的“可能性”。
进而,条件随机场模型可根据每个字对应每个词性分类的总得分确定每个字的词性分类,并且以连续的、具有相同词性分类的字为一个待分类词,以确定该待分类词的词性分类就是其中任意一个字的词性分类(待分类词中所有字的词性分类必然相同)。
其中,以上每个词性分类的“总得分”是指:每个得分函数都能确定至少部分字属于至少部分词性分类的得分,故每个字对应每个词性分类的得分可能由多个得分函数共同给出,故对每个字的每个词性分类,所有得分函数给出的得分的“总和”,就是该字对应该词性分类的“总得分”,该总得分也就体现出条件随机场模型整体判断该字属于该词性分类的概率。
在一些实施例中,根据目标文本中的每个字对应每个词性分类的总得分,确定每个字的词性分类包括:确定目标文本中的每个字对应的所有总得分中的最大总得分;以最大总得分对应的词性分类为该字的词性分类。
作为一种可行方式,可用每个字的多个词性分类的总得分中,最大的那个总得分对应的词性分类,作为该字的词性分类。
例如,对“儿童更容易患水痘”的文本,假设条件随机场模型共有3个得分函数(2个模板函数和1个义原函数,后续详细说明),而词性分类共有“疾病名”、“发病人群”、“非实体”3种,则根据条件随机场模型给出得分、根据得分确定待分类词和词性分类的过程可如下表:
Figure PCTCN2020073155-appb-000001
Figure PCTCN2020073155-appb-000002
应当理解,以上表格中的具体得分函数、具体得分值、具体词性分类等均只是示意性的,而不是对本公开实施例的实际方式的限定。尤其是,具体得分通常不是以上简单的整数值,故上表仅用于示意性的表明根据得分确定待分类词和词性分类的方式。
本公开实施例的条件随机场模型中,得分函数分为模板函数和义原函数。
其中,模板函数是一些相关技术的条件随机场模型中使用的得分函数,每个模板函数用于对目标文本中的每个字,给出其属于每个词性分类的得分。
在一些实施例中,每个模板函数具有特征模板和多个特征函数,每个特征模板规定有与当前位具有确定位置关系的提取位,任意两个模板函数的特征模板不同;每个模板函数用于依次使目标文本中的每个字处于当前位;在任意字处于当前位时,每个特征函数用于判断处于提取位的字与自身规定的预设文字的匹配情况,为处于当前位的字 给出对应一个预设词性分类的预设得分。
也就是说,模板函数是与特征模板对应的,每个特征模板规定有与“当前位”满足一定位置关系的“提取位”,且不同模板函数的提取位的设置方式不同。
由此,每个模板函数会依次以目标文本中的每个字为“当前位”,或者说用特征模板的当前位“遍历”目标文本中的每个字。
当目标文本中的某个字为“当前位”时,即可根据特征模板找到目标文本中位于“提取位”的字。
具体的,假设某模板函数的特征模板规定的提取位为“当前位的前1位(C-1)”,则对“儿童更容易患水痘”的文本,遍历过程可如下:
当前位为“童”时,提取位为“儿”字;
当前位为“更”时,提取位为“童”字;
当前位为“容”时,提取位为“更”字;
…以此类推。
显然,文本中每个字的词性分类,与其前后的字有一定的“概率性”关系,例如,“患有”二字后边的第一个字通常(但不绝对)是一种疾病的名称的第一个字,即“患有”二字后边的字的词性分类为“疾病名”的概率较大。
因此,以上“遍历”过程中,当前位对应不同字(如分别对应以上“童、更、容”时)时,提取位的字会构成一个“输入序列(如以上“儿-童-更”)”,而根据该输入序列,模板函数可“预测”输出序列为各种情况的概率,即预测当前位的字的词性分类构成的序列。
更具体的,每个模板函数包括多个特征函数。
其中,特征函数包括“预设文字”、“预设得分”、“预设词性分类”,这些都是训练得到的。
在当前位为任意字时,每个特征函数都判断“提取位”的字是否与自身的“预设文字”匹配(文字相同),若匹配则为当前位的字的 “预设词性分类”给出“预设得分”。显然,每个模板函数为每个字的各词性分类给出的总得分,为其中所有特征函数为各自的预设词性分类给出的预设得分的和。
例如,对于提取位为当前位1位和2位的特征模板,其可对应的特征函数可包括:
特征函数1,预设文字为“患有”,预设词性分类为“疾病名”,则其用于在当前位的前两个字是“患有”时,给当前位的字一定的对应“疾病名”的词性分类的得分。因为,“患有(某种疾病的名称)”是常见的描述。
特征函数2,预设文字为“群为”,预设词性分类为“发病人群”,则其用于在当前位的前两个字是“群为”时给当前位的字一定的对应“发病人群”的词性分类的得分。因为,“本疾病的主要发病人群为(某种人权)”是常见的描述。
应当理解,以上字与词性分类的关系只是“概率性”的,而不是必然的,例如“患有多种疾病”也是常用的描述,即“患有”后边第一个字也可能为“非实体”。
因此,以上模板函数还可包括特征函数3,用于在当前位的前两个字为“患有”时,给当前位的字一定的对应“非实体”的词性分类的得分。
具体的,每个特征函数本身可仅用于根据提取位的字是否与预设文字匹配给出0(不匹配)或1(匹配)的原始概率,而在训练中,不同特征函数可获得不同的权重(后续详细描述),从而特征函数的预设得分,为其原始概率与权重的乘积。
其中,以上权重可以是“负数”,其用于表示在提取位为预设文字时,当前位的字为预设词性分类的概率为负,即预设文字代表当前位的字“不可能”为预设词性分类。
当然,应当理解,由于不匹配特征函数的原始概率为0,故不匹配时,可认为特征函数为给预设词性分类打分,也可认为其打分为0,故不影响预设词性分类的实际得分。
总之,某特征函数的权重,实际体现该特征函数中的预设文字和预设词性分类的“概率性”的大小,即体现了当提取位的字为预设文字时,当前位的字为预设词性分类的概率。
由此可见,条件随机场模型的每个模板函数相当于一个条件概率分布模型P(Y|X),条件概率分布模型表示给定一组输入随机变量X的条件下,另一组输出随机变量Y的马尔可夫随机场。也就是说,模板函数的特点是假设输出随机变量构成马尔可夫随机场,因此,模板函数的具体公式可如下:
Figure PCTCN2020073155-appb-000003
其中,i表示当前位的编号;k表示特征函数的编号;f表示特征函数给出的原始概率(0或1);λ表示特征函数的权重;Z(O)用于实现归一化,使结果为0-1之间的概率值(当然其是可选的);P(I/O)表示在给定观测序列O的条件下求出的隐状态序列I的概率(得分)。
在一些实施例中,每个特征模板规定的提取位为以下任意一种;
C-2、C-1、C0、C1、C2、C-2C-1、C-1C0、C0C1、C1C2、C-3C-2C-1、C-2C-1C0、C-1C0C1、C0C1C2、C1C2C3;
其中,Cn表示当前位的后n个字的位置,C-n表示当前位的前n个字的位置,n为0、1、2、3中的任意一者(C0表示提取位就是当前位)。
也就是说,本公开实施例的条件随机场模型中的模板函数的特征模板可包括以上一种或多种。
其中,C-2、C-1、C0、C1、C2为一元模板(Unigram),如前,其“提取位”仅为一个字。
其中,C-2C-1、C-1C0、C0C1、C1C2为二元模板(Bigram),即其“提取位”为两个字。例如,对“儿童更容易患水痘”的文本,在当前位为“更”字时,对C-1C0的特征模板,提取位的字为“童更”,对C1C2的特征模板,则提取位的文字为“容易”。
类似的,C-3C-2C-1、C-2C-1C0、C-1C0C1、C0C1C2、C1C2C3为三元模板(Trigram),即其“提取位”为三个字。例如,对“儿童更容易患水痘”的文本,在当前位为“更”字时,对C-2C-1C0的特征模板,提取位的文字为“儿童更”,对C-1C0C1的特征模板,提取位的文字为“童更容”。
根据以上描述可见,“提取位”可能包括“当前位”,或者,“提取位”也可就是“当前位”。
同时,在当前位的字不同时,若根据不同的模板函数的特征模板,得到的提取位的字可能相同。
通过设置合理的特征模板,可从各种不同的角度分析文本之间的关系,从而通过多个特征模板(特征函数)共同为每个字的各词性词性分类给出最合理的得分。
例如,“糖尿病”是常见的疾病名称,故采用C0的特征模板的模板函数,在当前位(和提取位)为“糖”字时,可能给“糖”字较高的“疾病名”的词性分类的得分。
但是,目标文本中也可能有“我爱吃糖”这样的文本,其中的“糖”不是疾病名,若仅用对应C0特征模板对应的模板函数,可能将该“糖”字误认为“疾病名”。
而当有以上多个特征模板时,因为“我爱吃糖”的文本中的“糖”字后边不是“尿病”,故C1、C2、C0C1、C1C2等特征模板对应的模板函数,通常并不会给“糖”字很高的“疾病名”的词性分类的得分。
从而,可保证“糖尿病”中的“糖”字最终会获得较高的对应“疾病名”词性分类的总得分并被判断为“疾病名”,而“我爱吃糖”中的“糖”字则不会,因为只有个别特征模板会给其较高的“疾病名”的词性分类的得分,但“疾病名”的词性分类的总得分不会很高。
本公开实施例中,条件随机场模型的得分函数还包括义原函数,该义原函数不采用特征模板和特征函数,而是根据目标文本中的词(如 待匹配词)与预设的义原库中的义原的匹配情况,给出词中的各字的对应特定词性分类的得分。其中,对无匹配义原的词,义原函数并不为其给出得分(也可视为给出的得分为0)。
其中,“义原库”是预先总结好的,包括大量已知义原的数据库(如特定网站的义原库),例如Hownet知识库。
义原库中包括大量义原,以及每个义原相关的附加信息,例如,附加信息可包括义原的编号(ID)、义原的翻译(如英文翻译)、类型属性(Father信息)等。以“抽搐”义原为例,其附加信息可包括:Father信息:病态(编号105),ID:113,英文:twitch。
其中,以上附加信息包括义原的“类型属性(Father信息),即义原库中预先为义原确定的“类型”、“属性”、“标签”等。显然,从含义上看,“类型属性”可与“词性分类”可具有一定对应关系,该对应关系可预先人为设定。
其中,“类型属性”与“词性分类”可一一对应(当然类型属性的名称不一定与词性分类的名称直接相同);或者,也可以是多个“类型属性”对应同一个“词性分类”;或者,也可以是每个“类型属性”对应多个“词性分类”等,在此不再详细描述。
例如,“类型属性”也可包括“疾病名”;或者,“类型属性”可包括“病态”,且预先设置“病态”的“类型属性”对应“常见症状”的“词性分类”。
在一些实施例中,条件随机场模型还包括分词函数,分词函数用于从目标文本中划分出多个待匹配词;义原函数用于将每个待匹配词与预设的义原库中的义原进行匹配,并在待匹配词具有匹配的义原时,为该待匹配词的每个字给出与该义原在义原库中的类型属性对应的词性分类的得分。
也就是说,可通过预设的分词函数,对目标文本进行分词处理,从目标文本中划分出多个相对意义独立的词,也就是多个“待匹配词”,并将这些“待匹配词”与义原匹配。
可见,文本中很可能存在连续的字与义原匹配,但字实际不是一个词的情况。例如,文本“今天真热啊”中,“天真”两个字与义原匹配,但这两个字在文本中显然不是代表“天真”这个词的本意,即不是一个“词”。而
因此,通过分词函数先处理目标文本得到在意义上确实为“词”的待匹配词,并将待匹配词与义原进行匹配,可保证匹配结果的准确性。
当然,应当理解,此时分出的待匹配词,也并不一定都是最终确定的待分类词。例如,分词中可能分出“糖尿病”为待匹配词,而该“糖尿病”的文字在目标文本中位于“一型糖尿病”中,故最终可能是“一型糖尿病”被确定为一个“待分类词”。
当确定出待匹配词后,即可在义原库中寻找是否有文字上与待匹配词完全相同(即匹配)的义原,若某个待匹配词具有匹配的义原,则继续查找该义原的“类型属性”是否有对应的“词性分类”,若有,则直接对该待匹配词中的每个字,给出对应以上“词性分类”的得分。
例如,若对目标文本进行分词得到的待匹配词包括“糖尿病”,而义原库中有“糖尿病”的义原,其类型属性也对应“疾病名”的词性分类,则可给目标文本中的“糖、尿、病”三个字,各0.5的对应“疾病名”的词性分类的得分。
其中,由于每个待匹配词与义原或者匹配,或者不匹配,不存在“匹配概率”的问题,故在匹配时,义原函数给出的对应给词性分类的得分可以是固定的(当然也可以是通过“权重”计算实现的)。
通过设置“义原函数”,可利用义原发掘相关词语的语义特征。
例如,在训练用文本中“二型糖尿病”的词性分类为“疾病名”,而“糖尿病”是义原,且其“属性类型”对应的词性分类也为“疾病名”,故对目标文本中的“一型糖尿病”,义原函数可为其中的“糖尿病”给出较高的对应“疾病名”的词性分类的得分,再结合模板函数等其它得分函数,条件随机场模型有较大可能将目标文本中的“一型糖尿病”也判断为“疾病名”。
当然,条件随机场模型还可包括一些其它的得分函数。
例如,条件随机场模型可包括判断字是否为标点的得分函数(Punctuation:IsPunc)、是否为数字的得分函数(Digits:Isdigits)、是否为字母的得分函数(Alphabets:IsAlpha)、是否为句首的得分函数(Position of Char:Bos)、是否为句尾的得分函数(Position of Char:Eos)、是否为常见后缀的得分函数(Common Suffix)、是否为常见前缀的得分函数(Common Prefix)等。
而这些得分函数也可影响最终的处理结果。例如,若以上某得分函数判断一个字为“标点”,则可给该字很高的“标点”词性分类的得分,从而保证该字不会被认为是命名实体。或者,若以上某得分函数判断一个字为“句首”,则可在确定待分类词时,固定该字与其前一个字不可能为一个待分类词。
其中,以上各得分函数可根据Python语言的sklearn_crfsuite功能设计,在此不再详细描述。
S103、根据待分类词的词性分类确定目标文本中的命名实体。
根据以上得到的目标文本中的待分类词的词性分类,确定出目标文本的哪些部分为命名实体,以及这些命名实体对应的词性分类,作为识别结果。
其中,应当理解,以上识别结果可以多种不同的形式存在。
例如,可将识别结果保存为csv(逗号分隔值)文件,其中目标文本为一列,该列每行为目标文本的一个字(字符);而词性分类为另一列,该列中的每行为与目标文本在该行的字对应的词性分类。
具体的,若用符号S表示“疾病名”的词性分类,用符号“Q”表示“其它”的词性分类,则目标文本中“对糖尿病而言”这段内容对应的识别结果的具体保存形式可如下:
对 Q
糖 S
尿 S
病 S
而 Q
言 Q
可见,其中“糖”、“尿”、“病”三个连续的字均对应S的词性分类,故可识别出,“糖尿病”为一个命名实体,且其词性分类为“疾病名”。
可见,根据本公开实施例,在确定词性分类时,以词与“义原”的匹配状况以及义原的类型属性作为参考因素之一,而“义原”是已知的、具有确定意义的词,其类型属性也是预先确定的。因此,首先可以利用义原的位置关系提高关键词抽象程度以便减少特征模板(模板函数)数量;其次,可利用义原发掘相关词语的语义特征;再次,义原具有明显指向的词语语料库(义原库)有助于判断命名的类型(词性分类)。
同时,利用设计良好的特征模板(模板函数),可从少量标注数据中提取足够丰富的语义信息,可以大大减少人力标注数据的量,极大地节省人力成本。
总之,本公开实施例可提高特征模板的抽象程度,减少特征模板数量;从而,一方面提高运算和训练的速度;另一方面可提高命名实体识别的准确性,通过对部分按照本公开实施例的方式识别的目标文本进行人工识别,发现本公开实施例的方式的识别准确性可答90%以上。
在一些实施例中,根据待分类词的词性分类确定目标文本中的命名实体(S103)包括:抽取预定领域的词性分类的待分类词作为命名 实体。
如前,当希望识别某个预定领域的命名实体时,可仅提取与该预定领域相关的词性分类对应的待分类词,作为结果命名实体,并将这些命名实体单独保存下来。
在一些实施例中,预定领域为医学领域。
如前,以上的预定领域具体可为医学领域。
在一些实施例中,参照图2,获取目标文本(S101)包括:从医学数据库、医学网站、医学论文、医学教课书、病历中的至少一者,获取目标文本。
当要提取医学领域的命名实体时,也可仅选择医学领域的内容为目标文本,以增强目标文本的正对性。而医学领域的内容具体可包括医学数据库、医学网站、医学论文、医学教课书、病历等。
例如,具体的,可限定仅爬取“医学百科知识(如通过网站为百科知识词条给出的分类标签确定)”的内容,作为医学领域的目标文本。
在一些实施例中,参照图2,在根据待分类词的词性分类确定目标文本中的命名实体(S103)后,还包括:
S104、输出命名实体。
在通过以上方式识别出命名实体后,还可将识别结果输出,以供进行进一步的使用。
其中,以上“输出”的方式可包括显示识别结果、语音播报识别结果、打印出识别结果、通过复制/发送等方式将识别结果传输至其它设备等,在此不再详细描述。
在一些实施例中,参照图2,在根据预设的条件随机场模型确定目标文本中待分类词的词性分类(S102)前,还包括:
S1001、获取训练文本,训练文本中的每个字具有预设的词性分类。
S1001、利用训练文本训练所述条件随机场模型。
也就是说,以上条件随机场模型(CRF)可以是通过训练得到的。
具体的,可先选取部分文本,并通过人工的方式为文本中的每个字标注对应的词性分类,之后,利用这些文本作为训练文本,以训练条件随机场模型。
显然,此时获取的训练文本的领域,应当与希望识别的命名实体的领域尽量相同。
例如,若要识别医学领域的命名实体时,可从以上医学数据库、医学网站、医学论文、医学教课书、病历等中选择医学领域的文本作为训练文本。
具体的,条件随机场模型的训练过程可为:
利用当前的条件随机场模型处理训练文本,得到处理结果(每个字的词性分类);
将识别结果与人工标注的结果进行比对(即,比较条件随机场模型给出的词性分类与人工标注的词性分类是否相同);
根据识别结果与人工标注的结果的差别,调整条件随机场模型中的参数(主要是各特征函数的权重),并返回以上“利用当前的条件随机场模型识别训练文本”的步骤;
如此重复,可使条件随机场模型逐渐完善,直到满足结束条件时,训练结束,得到可后续直接应用的条件随机场模型。
其中,以上结束条件是多样的,例如结束条件可包括当前的条件随机场模型的识别结果的正确率达到预定值(或收敛),或者是,结束条件可包括训练循环的次数达到预定值等,在此不再详细描述。
更具体的,每个模板函数的特征函数是根据训练文本生成的。
例如,对以上任意一个模板函数,在当前位对应训练文本的某个字时,提取位对应的文字也就确定了,进而,可生成多个特征函数,每个特征函数的“预定词性分类”为一种可能的词性分类,而所有特 征函数的“预定文字”均相同(均为提取位对应的文字)。而通过对训练文本的遍历,在每个字处均可得到多个特征函数。从而,若假设训练文本的总字数为n,可能的词性分类数为L,则在训练中生成的特征函数的总数量为n*L。
而训练过程主要是调整以上各特征函数的权重,即根据不同“预定词性分类-预定文字”对应关系及出现次数等,可使不同特征函数的权重增大或减小。以最终保证根据调整后的权重,条件随机场模型处理目标文本时,可得到与人工标注的词性分类基本相同的处理结果。
例如,如前,因为“患有(某种疾病的名称)”的内容在训练文本中经常出现,因此,在训练过程中,对C-2C-1的模板函数,其中预定文字为“患有”而预定词性分类为“疾病名”的特征函数产生后,权重会不断增大;相对的,由于“患有”二字后边几乎不可能是“发病人群”,故预定文字为“患有”而预定词性分类为“发病人群”的特征函数产生后,权重会不断减小甚至为负。
其中,应当理解,义原函数的权重或给出的得分,也可在以上训练过程中不断调节。
由此,通过训练产生各特征模板(模板函数)的特征函数,并调整各特征函的权重后,通过这些模板函数和义原函数的综合作用,即可较为准确的实现命名实体的识别。
其中,条件随机场模型(CRF)属于机器学习算法,相当于其它的深度学习算法,条件随机场模型所需的训练量较小,故只要对少量的文本进行人工标注,即可完成训练过程,易于实现且效率高。
当然,如果条件随机场模型(CRF)是通过对其它存储介质中存储的、已训练完成的条件随机场模型进行拷贝而复制出的“复制品(复制品本身未经训练)”,也是可行的。
在一些实施例中,在得到命名实体后,还可继续根据这些命名实 体和其对应的词性分类,进一步确定不同命名实体的之间的关系,建立命名实体的知识图谱。
例如,可用以上得到的命名实体作为知识图谱中的“实体”,而用命名实体的词性分类作为知识图谱中的实体的“属性”,从而通过抽取的方式,确定知识图谱中有哪些实体,以及各实体间有什么关系(如具有相同属性),得到知识图谱。
在一些实施例中,每个知识图谱可为一个预定领域(如医学领域)的知识图谱,即其中的实体(命名实体)和属性(词性分类),均是与该预定领域相关的。
第二方面,参照图3,本公开实施例提供一种建立命名实体词典的方法,包括:
S201、确定多个目标文本。
确定要从哪些文本(目标文本)中获取命名实体,并建立命名实体词典。
其中,以上“确定”可为人为指定一些待处理的目标文本,也可为通过爬取等方式得到一些待处理的目标文本。
S202、根据上述任意一项的命名实体识别的方法,确定多个目标文本中的多个命名实体。
根据本公开实施例的命名实体识别的方法,从多个目标文本中,确定出多个命名实体,以及这些命名实体的词性分类。
S203、根据多个命名实体,建立命名实体词典。
根据以上确定的命名实体以及它们的词性分类,将部分命名实体集合在一起,构成一个相对独立的数据库,即构成“命名实体词典”。
在一些实施例中,每个命名实体词典中的词可均具有相同的词性分类。
例如,可将所有词性分类为“疾病名”的命名实体加入一个命名实体词典中,得到“疾病名词典”,将所有词性分类为“发病人群” 的命名实体加入一个命名实体词典中,得到“发病人群词典”等。
应当理解,当命名实体词典中的词均具有相同的词性分类时,则命名实体词典中可仅存储命名实体(即词),不存储其词性分类(即词的属性),而是通过命名实体词典的名称等,统一表示其中命名实体的词性分类。
在一些实施例中,每个命名实体词典中的词的词性分类,也可以是均属于一个预定领域的。
例如,可将词性分类为“疾病名”、“发病人群”、“患病部位”、“病因”、“常见症状”、“就诊科室”、“传播途径”、“生理指标”、“检查手段”、“治疗手段”等医学领域的命名实体加入一个命名实体词典中,得到“医学词典”。
在一些实施例中,每个命名实体词典中的词的词性分类,也可以是分别属于多个不同领域的,即命名实体词典可为“综合性词典”。
应当理解,当命名实体词典中的词具有不同的词性分类时,则命名实体词典中可同时存储命名实体(即词)和其词性分类(即词的属性)。
因数据量过大,故对根据本公开实施例的方法建立的命名实体词典进行抽样检测,其准确率达90%以上,例如,某些英文药品名称或检查方式,人工尚且需查询资料验证,而本公开实施例的方式可实现基本准确的识别。
第三方面,参照图4,本公开实施例提供一种命名实体识别的装置,其包括:
获取模块,配置为获取目标文本。
分类模块,根据预设的条件随机场模型确定所述目标文本中的待分类词及其词性分类;其中,所述条件随机场模型包括多个得分函数,所述得分函数包括义原函数和至少一个模板函数;每个所述模板函数用于给出所述目标文本中的每个字对应每个词性分类的得分;所述义 原函数用于将所述目标文本中的至少部分词与预设的义原库中的义原匹配,并在所述词具有匹配的义原时,为该词的每个字给出与该义原在所述义原库中的类型属性对应的词性分类的得分;所述条件随机场模型用于根据所述目标文本中的每个字对应每个词性分类的总得分,确定每个字的词性分类,其中,任意字对应任意词性分类的总得分为所有得分函数给出的该字对应该词性分类的得分的总和;所述条件随机场模型还用于确定具有相同词性分类的多个字为一个待分类词,并确定每个所述待分类词的词性分类为其中任意字的词性分类。
确定模块,配置为根据待分类词的词性分类确定目标文本中的命名实体。
本公开实施例的命名实体识别的装置可实现以上的命名实体识别的方法。
在一些实施例中,确定模块配置为抽取预定领域的词性分类的待分类词作为命名实体。
本公开实施例的命名实体识别的装置的确定模块,具体可也可抽取预定领域(如医学领域)的命名实体。
第四方面,参照图5,本公开实施例提供一种电子设备,其包括:
一个或多个处理器;
存储器,其上存储有一个或多个程序,当一个或多个程序被一个或多个处理器执行,使得一个或多个处理器实现上述任意一项的命名实体识别的方法。
本公开实施例的电子设备可实现以上的命名实体识别的方法。
在一些实施例中,参照图6,本公开实施例的电子设备还包括一个或多个I/O接口,连接在处理器与存储器之间,配置为实现处理器与存储器的信息交互。
本公开实施例的电子设备中,还可设置I/O接口,以实现其中处理器和存储器的数据交互。
其中,处理器为具有数据处理能力的器件,其包括但不限于中央处理器(CPU)等;存储器为具有数据存储能力的器件,其包括但不限于随机存取存储器(RAM,更具体如SDRAM、DDR等)、只读存储器(ROM)、带电可擦可编程只读存储器(EEPROM)、闪存(FLASH);I/O接口(读写接口)连接在处理器与存储器间,能实现存储器与处理器的信息交互,其包括但不限于数据总线(Bus)等。
第五方面,参照图7,本公开实施例提供一种计算机可读介质,其上存储有计算机程序,程序被处理器执行时实现上述任意一种命名实体识别的方法。
本公开实施例提供一种计算机可读介质中的程序被执行时,可实现以上的命名实体识别的方法。
本领域普通技术人员可以理解,上文中所公开的全部或某些步骤、***、装置中的功能模块/单元可以被实施为软件、固件、硬件及其适当的组合。
在硬件实施方式中,在以上描述中提及的功能模块/单元之间的划分不一定对应于物理组件的划分;例如,一个物理组件可以具有多个功能,或者一个功能或步骤可以由若干物理组件合作执行。
某些物理组件或所有物理组件可以被实施为由处理器,如中央处理器(CPU)、数字信号处理器或微处理器执行的软件,或者被实施为硬件,或者被实施为集成电路,如专用集成电路。这样的软件可以分布在计算机可读介质上,计算机可读介质可以包括计算机存储介质(或非暂时性介质)和通信介质(或暂时性介质)。如本领域普通技术人员公知的,术语计算机存储介质包括在用于存储信息(诸如计算机可读指令、数据结构、程序模块或其它数据)的任何方法或技术中实施的易失性和非易失性、可移除和不可移除介质。计算机存储介质包括但不限于随机存取存储器(RAM,更具体如SDRAM、DDR等)、只读存储器(ROM)、带电可擦可编程只读存储器(EEPROM)、闪 存(FLASH)或其它磁盘存储器;只读光盘(CD-ROM)、数字多功能盘(DVD)或其它光盘存储器;磁盒、磁带、磁盘存储或其它磁存储器;可以用于存储期望的信息并且可以被计算机访问的任何其它的介质。此外,本领域普通技术人员公知的是,通信介质通常包含计算机可读指令、数据结构、程序模块或者诸如载波或其它传输机制之类的调制数据信号中的其它数据,并且可包括任何信息递送介质。
本公开已经公开了示例实施例,并且虽然采用了具体术语,但它们仅用于并仅应当被解释为一般说明性含义,并且不用于限制的目的。在一些实例中,对本领域技术人员显而易见的是,除非另外明确指出,否则可单独使用与特定实施例相结合描述的特征、特性和/或元素,或可与其它实施例相结合描述的特征、特性和/或元件组合使用。因此,本领域技术人员将理解,在不脱离由所附的权利要求阐明的本公开的范围的情况下,可进行各种形式和细节上的改变。

Claims (16)

  1. 一种命名实体识别的方法,包括:
    获取目标文本;
    根据预设的条件随机场模型确定所述目标文本中的待分类词及其词性分类;其中,所述条件随机场模型包括多个得分函数,所述得分函数包括义原函数和至少一个模板函数;每个所述模板函数用于给出所述目标文本中的每个字对应每个词性分类的得分;所述义原函数用于将所述目标文本中的至少部分词与预设的义原库中的义原匹配,并在所述至少部分词具有匹配的义原时,为所述至少部分词的每个字给出与该义原在所述义原库中的类型属性对应的词性分类的得分;所述条件随机场模型用于根据所述目标文本中的每个字对应每个词性分类的总得分,确定每个字的词性分类,其中,任意字对应任意词性分类的总得分为所有得分函数给出的该字对应该词性分类的得分的总和;所述条件随机场模型还用于确定具有相同词性分类的多个字为一个待分类词,并确定每个所述待分类词的词性分类为其中任意字的词性分类;
    根据所述待分类词的词性分类确定所述目标文本中的命名实体。
  2. 根据权利要求1所述的方法,其中,
    每个所述模板函数具有特征模板和多个特征函数,每个所述特征模板规定有与当前位具有确定位置关系的提取位,任意两个所述模板函数的所述特征模板不同;
    每个所述模板函数用于依次使目标文本中的每个字处于当前位;在任意字处于当前位时,每个特征函数用于判断处于提取位的字与自身规定的预设文字的匹配情况,为处于当前位的字给出对应一个预设词性分类的预设得分。
  3. 根据权利要求2所述的方法,其中,每个所述特征模板规定的提取位为以下任意一种;
    C-2、C-1、C0、C1、C2、C-2C-1、C-1C0、C0C1、C1C2、C-3C-2C-1、C-2C-1C0、C-1C0C1、C0C1C2、C1C2C3;
    其中,Cn表示当前位的后n个字的位置,C-n表示当前位的前n个字的位置,n为0、1、2、3中的任意一者。
  4. 根据权利要求3所述的方法,其中,所述条件随机场模型还包括判断字是否为标点的得分函数、是否为数字的得分函数、是否为字母的得分函数、是否为句首的得分函数、是否为句尾的得分函数、是否为常见后缀的得分函数、是否为常见前缀的得分函数。
  5. 根据权利要求1-4任一项所述的方法,其中,所述根据所述目标文本中的每个字对应每个词性分类的总得分,确定每个字的词性分类包括:
    确定所述目标文本中的每个字对应的所有总得分中的最大总得分;
    以所述最大总得分对应的词性分类为该字的词性分类。
  6. 根据权利要求1-5任一项所述的方法,其中,
    所述条件随机场模型还包括分词函数,所述分词函数用于从目标文本中划分出多个待匹配词;
    所述义原函数用于将每个所述待匹配词与预设的所述义原库中的义原进行匹配,并在所述待匹配词具有匹配的义原时,为该待匹配词的每个字给出与该义原在所述义原库中的类型属性对应的词性分类的得分。
  7. 根据权利要求1-6任一项所述的方法,其中,所述根据所述待 分类词的词性分类确定所述目标文本中的命名实体包括:
    抽取预定领域的词性分类的待分类词作为命名实体。
  8. 根据权利要求7所述的方法,其中,
    所述预定领域为医学领域。
  9. 根据权利要求1-8任一项所述的方法,其中,所述获取目标文本包括:
    从医学数据库、医学网站、医学论文、医学教课书、病历中的至少一者,获取所述目标文本。
  10. 根据权利要求1-9任一项所述的方法,其中,在所述根据所述待分类词的词性分类确定所述目标文本中的命名实体后,还包括:
    输出所述命名实体。
  11. 根据权利要求1-10任一项所述的方法,其中,在所述根据预设的条件随机场模型确定所述目标文本中的待分类词以及待分类词的词性分类前,还包括:
    获取训练文本,所述训练文本中的每个字具有预设的词性分类;
    利用训练文本训练所述条件随机场模型。
  12. 一种建立命名实体词典的方法,包括:
    确定多个目标文本;
    根据权利要求1至11中任意一项所述的命名实体识别的方法,确定所述多个目标文本中的多个命名实体;
    根据所述多个命名实体,建立命名实体词典。
  13. 一种命名实体识别的装置,其包括:
    获取模块,配置为获取目标文本;
    分类模块,配置为根据预设的条件随机场模型确定所述目标文本中的待分类词及其词性分类;其中,所述条件随机场模型包括多个得分函数,所述得分函数包括义原函数和至少一个模板函数;每个所述模板函数用于给出所述目标文本中的每个字对应每个词性分类的得分;所述义原函数用于将所述目标文本中的至少部分词与预设的义原库中的义原匹配,并在所述至少部分词具有匹配的义原时,为所述至少部分词的每个字给出与该义原在所述义原库中的类型属性对应的词性分类的得分;所述条件随机场模型用于根据所述目标文本中的每个字对应每个词性分类的总得分,确定每个字的词性分类,其中,任意字对应任意词性分类的总得分为所有得分函数给出的该字对应该词性分类的得分的总和;所述条件随机场模型还用于确定具有相同词性分类的多个字为一个待分类词,并确定每个所述待分类词的词性分类为其中任意字的词性分类;
    确定模块,配置为根据所述待分类词的词性分类确定所述目标文本中的命名实体。
  14. 根据权利要求13所述的装置,其中,
    所述确定模块配置为抽取预定领域的词性分类的待分类词作为命名实体。
  15. 一种电子设备,其包括:
    一个或多个处理器;
    存储器,其上存储有一个或多个程序,当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现根据权利要求1至11中任意一项所述的命名实体识别的方法。
  16. 一种计算机可读介质,其上存储有计算机程序,所述程序被处理器执行时实现根据权利要求1至11中任意一项所述的命名实体识别的方法。
PCT/CN2020/073155 2020-01-20 2020-01-20 实体识别的方法和装置、建立词典的方法、设备、介质 WO2021146831A1 (zh)

Priority Applications (4)

Application Number Priority Date Filing Date Title
US17/299,023 US20220318509A1 (en) 2020-01-20 2020-01-20 Entity recognition method and device, dictionary creating method, device and medium
EP20891421.8A EP4095738A4 (en) 2020-01-20 2020-01-20 METHOD AND DEVICE FOR RECOGNIZING ENTITIES, METHOD FOR GENERATION OF A DICTIONARY, DEVICE AND MEDIUM
CN202080000047.1A CN113632092A (zh) 2020-01-20 2020-01-20 实体识别的方法和装置、建立词典的方法、设备、介质
PCT/CN2020/073155 WO2021146831A1 (zh) 2020-01-20 2020-01-20 实体识别的方法和装置、建立词典的方法、设备、介质

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/073155 WO2021146831A1 (zh) 2020-01-20 2020-01-20 实体识别的方法和装置、建立词典的方法、设备、介质

Publications (1)

Publication Number Publication Date
WO2021146831A1 true WO2021146831A1 (zh) 2021-07-29

Family

ID=76992002

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/073155 WO2021146831A1 (zh) 2020-01-20 2020-01-20 实体识别的方法和装置、建立词典的方法、设备、介质

Country Status (4)

Country Link
US (1) US20220318509A1 (zh)
EP (1) EP4095738A4 (zh)
CN (1) CN113632092A (zh)
WO (1) WO2021146831A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113672742A (zh) * 2021-08-19 2021-11-19 鲨鱼快游网络技术(北京)有限公司 一种知识共建方法、电子设备及介质
CN114154502A (zh) * 2022-02-09 2022-03-08 浙江太美医疗科技股份有限公司 医学文本的分词方法、装置、计算机设备和存储介质
CN114997171A (zh) * 2022-06-17 2022-09-02 平安科技(深圳)有限公司 实体识别方法、装置、设备及存储介质
CN116680603A (zh) * 2023-07-26 2023-09-01 上海观安信息技术股份有限公司 一种数据分类方法、装置、存储介质及电子设备

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11907678B2 (en) * 2020-11-10 2024-02-20 International Business Machines Corporation Context-aware machine language identification
US11977841B2 (en) * 2021-12-22 2024-05-07 Bank Of America Corporation Classification of documents
CN116910278A (zh) * 2023-09-14 2023-10-20 深圳市智慧城市科技发展集团有限公司 数据字典的生成方法、终端设备和存储介质
CN117034942B (zh) * 2023-10-07 2024-01-09 之江实验室 一种命名实体识别方法、装置、设备及可读存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101510221A (zh) * 2009-02-17 2009-08-19 北京大学 一种用于信息检索的查询语句分析方法与***
CN107797992A (zh) * 2017-11-10 2018-03-13 北京百分点信息科技有限公司 命名实体识别方法及装置
US20180203843A1 (en) * 2017-01-13 2018-07-19 Yahoo! Inc. Scalable Multilingual Named-Entity Recognition
CN108549639A (zh) * 2018-04-20 2018-09-18 山东管理学院 基于多特征模板修正的中医医案命名识别方法及***
CN109670179A (zh) * 2018-12-20 2019-04-23 中山大学 基于迭代膨胀卷积神经网络的病历文本命名实体识别方法
CN110597970A (zh) * 2019-08-19 2019-12-20 华东理工大学 一种多粒度医疗实体联合识别的方法及装置

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100597437B1 (ko) * 2004-12-17 2006-07-06 한국전자통신연구원 하이브리드 정답유형 인식 장치 및 방법
US9367526B1 (en) * 2011-07-26 2016-06-14 Nuance Communications, Inc. Word classing for language modeling
US9817814B2 (en) * 2015-12-31 2017-11-14 Accenture Global Solutions Limited Input entity identification from natural language text information
US10157177B2 (en) * 2016-10-28 2018-12-18 Kira Inc. System and method for extracting entities in electronic documents
CN107391485A (zh) * 2017-07-18 2017-11-24 中译语通科技(北京)有限公司 基于最大熵和神经网络模型的韩语命名实体识别方法
CN109858018A (zh) * 2018-12-25 2019-06-07 中国科学院信息工程研究所 一种面向威胁情报的实体识别方法及***
US11170167B2 (en) * 2019-03-26 2021-11-09 Tencent America LLC Automatic lexical sememe prediction system using lexical dictionaries

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101510221A (zh) * 2009-02-17 2009-08-19 北京大学 一种用于信息检索的查询语句分析方法与***
US20180203843A1 (en) * 2017-01-13 2018-07-19 Yahoo! Inc. Scalable Multilingual Named-Entity Recognition
CN107797992A (zh) * 2017-11-10 2018-03-13 北京百分点信息科技有限公司 命名实体识别方法及装置
CN108549639A (zh) * 2018-04-20 2018-09-18 山东管理学院 基于多特征模板修正的中医医案命名识别方法及***
CN109670179A (zh) * 2018-12-20 2019-04-23 中山大学 基于迭代膨胀卷积神经网络的病历文本命名实体识别方法
CN110597970A (zh) * 2019-08-19 2019-12-20 华东理工大学 一种多粒度医疗实体联合识别的方法及装置

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113672742A (zh) * 2021-08-19 2021-11-19 鲨鱼快游网络技术(北京)有限公司 一种知识共建方法、电子设备及介质
CN113672742B (zh) * 2021-08-19 2024-04-26 鲨鱼快游网络技术(北京)有限公司 一种知识共建方法、电子设备及介质
CN114154502A (zh) * 2022-02-09 2022-03-08 浙江太美医疗科技股份有限公司 医学文本的分词方法、装置、计算机设备和存储介质
CN114997171A (zh) * 2022-06-17 2022-09-02 平安科技(深圳)有限公司 实体识别方法、装置、设备及存储介质
CN116680603A (zh) * 2023-07-26 2023-09-01 上海观安信息技术股份有限公司 一种数据分类方法、装置、存储介质及电子设备
CN116680603B (zh) * 2023-07-26 2023-12-12 上海观安信息技术股份有限公司 一种数据分类方法、装置、存储介质及电子设备

Also Published As

Publication number Publication date
EP4095738A1 (en) 2022-11-30
US20220318509A1 (en) 2022-10-06
CN113632092A (zh) 2021-11-09
EP4095738A4 (en) 2023-01-04

Similar Documents

Publication Publication Date Title
WO2021146831A1 (zh) 实体识别的方法和装置、建立词典的方法、设备、介质
WO2021000676A1 (zh) 问答方法、问答装置、计算机设备及存储介质
CN110892399B (zh) 自动生成主题内容摘要的***和方法
Daud et al. Urdu language processing: a survey
Han et al. Lexical normalization for social media text
US8751218B2 (en) Indexing content at semantic level
Hernandez-Alvarez et al. Citation function, polarity and influence classification
Zubrinic et al. The automatic creation of concept maps from documents written using morphologically rich languages
US20160132572A1 (en) Collecting, organizing, and searching knowledge about a dataset
US20150120738A1 (en) System and method for document classification based on semantic analysis of the document
US20150227505A1 (en) Word meaning relationship extraction device
Athar Sentiment analysis of scientific citations
Şeker et al. Extending a CRF-based named entity recognition model for Turkish well formed text and user generated content 1
Rigouts Terryn et al. HAMLET: hybrid adaptable machine learning approach to extract terminology
Alyami et al. Systematic literature review of Arabic aspect-based sentiment analysis
Wong et al. iSentenizer‐μ: Multilingual Sentence Boundary Detection Model
Yeniterzi et al. Turkish named-entity recognition
Stadsnes Evaluating semantic vectors for norwegian
Mekki et al. Tokenization of Tunisian Arabic: a comparison between three Machine Learning models
Mollaei et al. Question classification in Persian language based on conditional random fields
Vlachos et al. Bootstrapping the recognition and anaphoric linking of named entities in drosophila articles
Ahnaf et al. An improved extrinsic monolingual plagiarism detection approach of the Bengali text.
Boudjellal et al. A silver standard biomedical corpus for Arabic language
Fenz et al. De-identification of unstructured paper-based health records for privacy-preserving secondary use
SILVA Extracting structured information from text to augment knowledge bases

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2020891421

Country of ref document: EP

Effective date: 20220822