CN107943786A

CN107943786A - A kind of Chinese name entity recognition method and system

Info

Publication number: CN107943786A
Application number: CN201711137581.3A
Authority: CN
Inventors: 吴远辉
Original assignee: Guangzhou Wanlong Securities Advisory Consultants Co Ltd
Current assignee: Guangzhou Wanlong Securities Advisory Consultants Co Ltd
Priority date: 2017-11-16
Filing date: 2017-11-16
Publication date: 2018-04-20
Anticipated expiration: 2037-11-16
Also published as: CN107943786B

Abstract

The invention discloses a kind of Chinese name entity recognition method and system, this method to comprise the following steps：S1, carry out rule-based matched Entity recognition to target text, obtains the first name entity sets；S2, carry out target text using statistic algorithm Entity recognition, the name entity sets of acquisition second；S3, after being cleaned to the first name entity sets and the second name entity sets, obtain recognition result.After the present invention is based respectively on rule match and statistic algorithm to target text progress Entity recognition, after both recognition results are cleaned, ask for obtaining last Chinese Entity recognition result, can be while Chinese Entity recognition accuracy rate be ensured, greatly improve the recall ratio of Chinese Entity recognition, and Chinese entity automatic identification is carried out by this method, recognition speed is fast, can be widely applied in the field of information processing to text.

Description

A kind of Chinese name entity recognition method and system

Technical field

The present invention relates to computer application and field of information processing, more particularly to a kind of Chinese name entity recognition method And system.

Background technology

It is information element basic in target text to name entity, is the basis of correct understanding target text.Chinese entity Name identification is the important foundation instrument of the application fields such as information extraction, syntactic analysis, machine learning, in natural language processing skill Art occupies critical role during moving towards practical.Chinese name Entity recognition seeks to judge whether a character string represents One name entity.In information extraction research, Chinese name Entity recognition is a technology most with practical value at present.Often Method is to be based purely on the recognition methods of hidden Markov, maximum entropy model.

At present, since the name of Chinese Business Name is not strong with word rule, use is more random, often in the form of abbreviation Occur, such as " Bank of China Co., Ltd. " often occurs in the form of abbreviation, and such as " Bank of China " or " middle row ", this is public affairs Take charge of the identification of name, using bringing difficulty.It is identified generally, for referred to as this kind of Chinese name entity of Chinese company, There are following difficult point：1st, under different field, scene, name the extension of abbreviation variant.2nd, certain form of entity name becomes Change frequently, and can be followed without stringent rule.3rd, expression-form is various.4th, enormous amount, it is impossible to enumerate, it is difficult to all It is embodied in dictionary.Generally speaking, in the processing of Chinese target text, since Chinese word segmentation effect largely effects on Chinese name The recognition effect of entity, and then target text analysis and treatment effect are influenced, cause that recall ratio is low and recognition speed is slow.

The content of the invention

In order to solve above-mentioned technical problem, the object of the present invention is to provide a kind of Chinese name entity recognition method and it is System.

The technical solution adopted by the present invention to solve the technical problems is：

A kind of Chinese name entity recognition method, comprises the following steps：

S1, carry out rule-based matched Entity recognition to target text, obtains the first name entity sets；

S2, carry out target text using statistic algorithm Entity recognition, the name entity sets of acquisition second；

S3, after being cleaned to the first name entity sets and the second name entity sets, obtain recognition result.

Further, the step S1, specifically includes：

The content of target text, be separated by S11 by sentence；

S12, carry out the content extraction based on punctuation mark rule to the target text after separation；

S13, carry out the content extraction based on syntactic template rule to the target text after separation；

S14, carry out the content extraction based on table features to the target text after separation；

S15, all name entities generation the first name entity sets that acquisition will be extracted.

Further, the step S2, specifically includes：

S21, by target text carry out word segmentation processing；

S22, based on default part of speech database, part-of-speech tagging is carried out to word segmentation processing result；

S23, based on hidden Markov model statistical learning method, after carrying out statistical analysis to part-of-speech tagging result, will point Name entity generation the second name entity sets that analysis obtains.

Further, the step S3, specifically includes：

S31, according to default noise lexicon, the first name entity sets and the second name entity sets are carried out respectively Data cleansing, rejects noise vocabulary；

S32, by after cleaning first name entity sets and second name entity sets seek union after, as name entity Recognition result.

Another technical solution is used by the present invention solves its technical problem：

A kind of Chinese name entity recognition system, including with lower module：

First identification module, for carrying out rule-based matched Entity recognition to target text, it is real to obtain the first name Body set；

Second identification module, for carrying out Entity recognition to target text using statistic algorithm, obtains the second name entity Set；

Cleaning module, after being cleaned to the first name entity sets and the second name entity sets, is identified As a result.

Further, first identification module, specifically includes：

Separating element, for the content of target text to be separated by sentence；

First extracting unit, for carrying out the content extraction based on punctuation mark rule to the target text after separation；

Second extracting unit, for carrying out the content extraction based on syntactic template rule to the target text after separation；

3rd extracting unit, for carrying out the content extraction based on table features to the target text after separation；

Generation unit, for all name entities obtained generation the first name entity sets will to be extracted.

Further, second identification module, specifically includes：

Word segmentation processing unit, for target text to be carried out word segmentation processing；

Part-of-speech tagging unit, for based on default part of speech database, part-of-speech tagging to be carried out to word segmentation processing result；

Statistical analysis unit, for based on hidden Markov model statistical learning method, uniting to part-of-speech tagging result After meter analysis, name entity generation the second name entity sets of acquisition will be analyzed.

Further, the cleaning module, specifically includes：

Data cleansing unit, for according to default noise lexicon, ordering respectively the first name entity sets and second Name entity sets carries out data cleansing, rejects noise vocabulary；

Computing unit, after the first name entity sets after cleaning and the second name entity sets are asked union, makees To name Entity recognition result.

The method of the present invention, the beneficial effect of system are：The present invention is based respectively on rule match and statistic algorithm to target text After this progress Entity recognition, after both recognition results are cleaned, ask for obtaining last Chinese Entity recognition as a result, can While Chinese Entity recognition accuracy rate is ensured, to greatly improve the recall ratio of Chinese Entity recognition, and pass through this method Chinese entity automatic identification is carried out, recognition speed is fast.

Brief description of the drawings

Fig. 1 is the flow chart of the Chinese name entity recognition method of the present invention；

Fig. 2 is the structure diagram of the Chinese name entity recognition system of the present invention.

Embodiment

With reference to Fig. 1, the present invention provides a kind of Chinese name entity recognition method, comprise the following steps：

Wherein, target text refers to that needs carry out the text of Chinese name Entity recognition.

After this method is based respectively on rule match and statistic algorithm to target text progress Entity recognition, by both identification As a result after being cleaned, ask for obtaining last Chinese Entity recognition as a result, can ensure Chinese Entity recognition accuracy rate Meanwhile greatly improve the recall ratio of Chinese Entity recognition, and Chinese entity automatic identification is carried out by this method, can have compared with Fast recognition speed.

Preferred embodiment is further used as, the step S1, specifically includes：

The content of target text, be separated by S11 by sentence；

S12, carry out the content extraction based on punctuation mark rule to the target text after separation；Such as in some files, Custom adds double quotation marks in entity name, or plus punctuation marks used to enclose the title, at this time, the title in double quotation marks or punctuation marks used to enclose the title is extracted Come.Therefore, corresponding punctuation mark rule, these punctuation marks rule note can be created according to the use habit of people Load and the Chinese relevant punctuation mark of entity name and corresponding decimation rule, content extraction is carried out according to punctuation mark rule Afterwards as the alternative of Chinese entity name.

S13, carry out the content extraction based on syntactic template rule to the target text after separation；For example, " declaration ", Subject before the verbs such as " title ", " saying ", is typically all entity name, therefore, according to the language habits, creates corresponding syntax mould Plate gauge then, these syntactic templates rule record with the Chinese relevant word of entity name and corresponding decimation rule, so as to To be extracted according to syntactic template regular targets text.

Preferred embodiment is further used as, the step S2, specifically includes：

S21, by target text carry out word segmentation processing；

S23, based on hidden Markov model statistical learning method, after carrying out statistical analysis to part-of-speech tagging result, will point Name entity generation the second name entity sets that analysis obtains.This step is based on hidden Markov model statistical learning method, first According to known, correct entity name, the probability that keyword occurs before it is counted, then by the high keyword of probability, Extrapolate entity name.So as on the premise of the Chinese entity name accuracy rate that identification obtains is not influenced, greatly improve The recall ratio of identification, more can comprehensively identify the Chinese entity name obtained in text, and be obtained by automatic identification Chinese entity name, recognition speed are fast.

Preferred embodiment is further used as, the step S3, specifically includes：

With reference to Fig. 2, the present invention provides a kind of Chinese name entity recognition system, including with lower module：

First identification module 100, for carrying out rule-based matched Entity recognition to target text, obtains the first name Entity sets；

Second identification module 200, for carrying out Entity recognition to target text using statistic algorithm, it is real to obtain the second name Body set；

Cleaning module 300, after being cleaned to the first name entity sets and the second name entity sets, is known Other result.

Preferred embodiment is further used as, first identification module 100, specifically includes：

Preferred embodiment is further used as, second identification module 200, specifically includes：

Preferred embodiment is further used as, the cleaning module 300, specifically includes：

One kind Chinese name entity recognition system of the present invention, can perform foregoing the provided one kind Chinese name of the present invention Entity recognition method, any combination implementation steps of executing method embodiment, possess the corresponding function of this method and beneficial to effect Fruit.

Above is the preferable of the present invention is implemented to be illustrated, but the invention is not limited to the implementation Example, those skilled in the art can also make a variety of equivalent variations on the premise of without prejudice to spirit of the invention or replace Change, these equivalent modifications or replacement are all contained in the application claim limited range.

Claims

1. a kind of Chinese name entity recognition method, it is characterised in that comprise the following steps：

A kind of 2. Chinese name entity recognition method according to claim 1, it is characterised in that the step

S1, specifically includes：

The content of target text, be separated by S11 by sentence；

A kind of 3. Chinese name entity recognition method according to claim 1, it is characterised in that the step

S2, specifically includes：

S21, by target text carry out word segmentation processing；

S23, based on hidden Markov model statistical learning method, after carrying out statistical analysis to part-of-speech tagging result, analysis is obtained Name entity generation the second name entity sets obtained.

A kind of 4. Chinese name entity recognition method according to claim 1, it is characterised in that the step

S3, specifically includes：

S31, according to default noise lexicon, data are carried out to the first name entity sets and the second name entity sets respectively Cleaning, rejects noise vocabulary；

S32, by after cleaning first name entity sets and second name entity sets seek union after, as name Entity recognition As a result.

5. a kind of Chinese name entity recognition system, it is characterised in that including with lower module：

First identification module, for carrying out rule-based matched Entity recognition to target text, obtains the first name entity set Close；

Second identification module, for carrying out Entity recognition to target text using statistic algorithm, obtains the second name entity sets；

Cleaning module, after being cleaned to the first name entity sets and the second name entity sets, obtains recognition result.

A kind of 6. Chinese name entity recognition system according to claim 5, it is characterised in that the first identification mould Block, specifically includes：

A kind of 7. Chinese name entity recognition system according to claim 5, it is characterised in that the second identification mould Block, specifically includes：

Statistical analysis unit, for based on hidden Markov model statistical learning method, statistical to be carried out to part-of-speech tagging result After analysis, name entity generation the second name entity sets of acquisition will be analyzed.

8. a kind of Chinese name entity recognition system according to claim 5, it is characterised in that the cleaning module, tool Body includes：

Data cleansing unit, for according to default noise lexicon, naming respectively the first name entity sets and second real Body set carries out data cleansing, rejects noise vocabulary；

Computing unit, after the first name entity sets after cleaning and the second name entity sets are asked union, as life Name Entity recognition result.