CN109408819A

CN109408819A - A kind of core place name extracting method and device based on natural language processing technique

Info

Publication number: CN109408819A
Application number: CN201811202492.7A
Authority: CN
Inventors: 段春先; 尹展鹏; 胡锐; 程方
Original assignee: WUDA GEOINFORMATICS CO Ltd
Current assignee: WUDA GEOINFORMATICS CO Ltd
Priority date: 2018-10-16
Filing date: 2018-10-16
Publication date: 2019-03-01
Anticipated expiration: 2038-10-16
Also published as: CN109408819B

Abstract

The present invention is suitable for technical field of geographic information, provides a kind of core place name extracting method and device based on natural language processing technique, described device includes: Chinese word segmentation dictionary production unit；Place name set acquiring unit；Frequency score computing unit；Importance score calculation unit；Relationship score calculation unit；Component score computing unit；Total score sequencing unit；Core place name judging unit.Core place name extraction algorithm step in the present invention is simple, should be readily appreciated that and realizes, can quickly be applied in production project.

Description

A kind of core place name extracting method and device based on natural language processing technique

Technical field

The invention belongs to technical field of geographic information more particularly to a kind of core place names based on natural language processing technique Extracting method and device.

Background technique

Place name identification is one of natural language processing name Entity recognition, and multiple place names can be extracted from text. It is divided according to the degree of correlation of content of text, place name can be divided into core place name, strong correlation place name and weak related place name, core Place name is place name most related to text subject, and strong correlation place name is that have certain associated place name with text subject, weak correlation place name It is the place name not high with the text subject degree of association.Application scenarios are monitored in internet public feelings, computer is to internet public feelings information When carrying out regional analysis, weak correlation place name generates interference to analysis, so that the precision of regional analysis is lower, it is difficult to extract text The core place name closely related with theme in notebook data.Existing natural language processing algorithm can only generally extract in text Place name, but the membership between place name cannot be expressed, these place names can not be pressed and be arranged with the text subject degree of correlation Sequence can not obtain the place name closely related with text subject.

The Chinese patent of application number CN201410381574.8, the entitled intelligent place name identification technology based on statistical model Intelligent place name identification technology based on statistical model takes the superior and the subordinate's place name identification, place name statistical model context identification, name Middle place name disappears the methods of qi, provides the high-accuracy of practical level to place name identification, but the program do not identify place name with The degree of correlation of text subject.

In view of the above deficiency, the core place name extracting method that the present invention provides a kind of based on natural language processing technique and Device can be extracted and the high place name of topic correlativity using this method and device from place names multiple in text.

Summary of the invention

In view of the above problems, the purpose of the present invention is to provide a kind of, and the core place name based on natural language processing technique mentions Take method and device, it is intended to solve weak related place name in prior art processing and interference is generated to analysis, so that the essence of regional analysis Accuracy is lower, it is difficult to extract the technical problems such as core place name closely related with theme in text data.

The present invention adopts the following technical scheme:

The core place name extracting method based on natural language processing technique includes the following steps:

Step S1: being made into one-to-one tables of data for national province, city, the title in county and its administrative division code, and It saves and is subordinate to a grade relationship between place name, the title in national province, city, county is made as Chinese word segmentation dictionary；

Step S2: according to the Chinese word segmentation dictionary, using natural language processing tool to one section of specified text data into Row Chinese word segmentation, obtain include sentence element place name set；

Step S3: counting the number that all identical place names occur in place name set, calculates the frequency that each place name occurs Score；

Step S4: judging whether place name appears in the title of text data, and the place name importance score is calculated；

Step S5: each place name in place name set is subordinate to according to the national province, city, county's membership that are saved in step S1 Belong to the scoring of grade relationship, calculates membership score；

Step S6: according to ingredient of the place name each in place name set in sentence, place name component score is obtained；

Step S7: four scores of step S3-S6 are added, and obtain the topic correlativity score of place name, and are single with city Position carries out place name polymerization, calculates city and city and has all district place name topic correlativity score summations under its command, and by score from height to It is low to be ranked up；

Step S8: judging whether topic correlativity score summation peak reaches the minimum score value of core place name, if not Reach, then coreless place name in this article notebook data, if reached, chooses the highest urban place name of topic correlativity score summation For core place name.

Further, frequency score calculation formula described in step S3 is as follows:

Wherein S_ifFor frequency score, f_iThe number occurred for i-th in place name set identical place name.

Further, place name importance score calculation formula described in step S4 is as follows:

S_it=S_t, wherein if place name appears in the title of text data, i.e. S_t=1, on the contrary S_t=0.

Further, membership score calculation formula is as follows in step S5:

S_ir=S_r, wherein judging to obtain membership if higher level's place name of a place name appears in place name set Score S_r=0.5, on the contrary S_r=0.

Further, in step S6, sentence element of the place name in text data sentence is subject, the adverbial modifier, attribute or guest Language, place name component score mode are as follows:

S_ic=S_cz+S_ch+S_cd+S_cb, wherein S_cz、S_ch、S_cd、S_cbPlace name is respectively represented as subject, the adverbial modifier, attribute and guest The score of language, the every appearance of place name is primary to calculate place name component score.

On the other hand, the core place name extraction element based on natural language processing technique includes such as lower unit:

Chinese word segmentation dictionary production unit: for national province, city, the title in county and its administrative division code to be made into one One corresponding tables of data, and save and be subordinate to a grade relationship between place name, the title in national province, city, county is made as Chinese word segmentation Dictionary；

Place name set acquiring unit: it is used for according to the Chinese word segmentation dictionary, using natural language processing tool to one section Specified text data carries out Chinese word segmentation, obtain include sentence element place name set；

Frequency score computing unit: it for counting the number that all identical place names occur in place name set, calculates each The frequency score that place name occurs；

Importance score calculation unit: for judging whether place name appears in the title of text data, this is calculated Place name importance score；

Relationship score calculation unit: for according to the national province, city, county's membership saved in step S1 to place name set In each place name be subordinate to a grade relationship scoring, calculate membership score；

Component score computing unit: for the ingredient according to place name each in place name set in sentence, place name ingredient is obtained Score；

Total score sequencing unit: for frequency score, importance score, relationship score, component score four to be obtained split-phase Add, obtain the topic correlativity score of place name, and carry out place name polymerization as unit of city, calculates city and city has all districts under its command Place name topic correlativity score summation, and be ranked up from high to low by score；

Core place name judging unit: for judging it is minimum whether topic correlativity score summation peak reaches core place name Score value, if not up to, coreless place name in this article notebook data chooses topic correlativity score summation if reached Highest urban place name is core place name.

The beneficial effects of the present invention are: the present invention constructs a Rating Model according to the characteristics of place name occurs in text, A kind of frequency of occurrences based on place name, position, dependence, the description method of membership are defined, based at natural language Reason technology, establishes core place name identification algorithm, and the place name extracted in text information is established membership, and give each place name with The text subject degree of correlation, to achieve the purpose that core place name is extracted in identification from text information, proposes a kind of text-oriented Core place name extracting method, the present invention in core place name extraction algorithm step it is simple, should be readily appreciated that and realize, can quickly answer It uses in production project.

Detailed description of the invention

Fig. 1 is the core place name extracting method flow chart provided in an embodiment of the present invention based on natural language processing technique；

Fig. 2 is the core place name extraction element structure chart provided in an embodiment of the present invention based on natural language processing technique.

Specific embodiment

In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.

In order to illustrate technical solutions according to the invention, the following is a description of specific embodiments.

Embodiment one:

Fig. 1 shows the core place name extracting method process provided in an embodiment of the present invention based on natural language processing technique Figure, only parts related to embodiments of the present invention are shown for ease of description.

This method can extract core place name at county level and above county level from one section of specified text data.At one section In text information, wherein multiple place names may be mentioned to, some place names and file content theme are closely related, some place names with The content of text theme degree of association is little, and the little weak rigidity place name of the degree of association can generate interference to computer understanding text information. Present invention aim to extract core place name closely related with theme in text information.

Core place name extracting method provided in an embodiment of the present invention based on natural language processing technique includes the following steps:

Step S1: being made into one-to-one tables of data for national province, city, the title in county and its administrative division code, and It saves and is subordinate to a grade relationship between place name, the title in national province, city, county is made as Chinese word segmentation dictionary.

In step of the present invention, national province, city, the title in county and administrative division code data are current with China's Statistical office The administrative division of publication and code are foundation.Grade relationship that is subordinate between place name is preserved in tables of data, for example province is the upper of city Grade, a province includes multiple cities, and city is the higher level in county (or area), and a city has several counties (or area).Make Chinese word segmentation dictionary The present embodiment uses HanNLP tool making.

Step S2: according to the Chinese word segmentation dictionary, using natural language processing tool to one section of specified text data into Row Chinese word segmentation, obtain include sentence element place name set.

Text is segmented using natural language processing tool (using HanNLP tool in this method) in this step.From Right Language Processing is a kind of various theory and methods realized and carry out efficient communication between people and computer with natural language, mainly Scope includes Chinese Automatic Word Segmentation, part-of-speech tagging, text classification, name Entity recognition, interdependent syntactic analysis, speech recognition, letter Breath retrieval, machine translation and autoabstract etc..In the method for the present invention, using the part-of-speech tagging in natural language processing tool Function, the name functions such as Entity recognition function and interdependent syntactic analysis.Assuming that the place name collection in this step is combined into M.Text data In title handled by natural language tool, the convenient a set of code of points next designed.

Step S3: counting the number that all identical place names occur in place name set, calculates the frequency that each place name occurs Score.

Frequency score calculation formula described in step S3 is as follows:

In one section of text, after participle, there is multiple and different place names in text, and some place names go out in the text Existing number more than once, is segmented by natural language processing tool and is counted, different each place name in available place name set M The number occurred with same place name.According to above-mentioned formula, it is known that the number that same place name occurs is more than that there is no obviously for 3 times Difference, it may be said that bright this has biggish relationship with the place name herein, so needing exist for limit frequency score.

Step S4: judging whether place name appears in the title of text data, and the place name importance score is calculated.

Place name importance score calculation formula described in step S4 is as follows:

S_it=S_t, wherein if place name appears in the title of text data, i.e. S_t=1.0, on the contrary S_t=0.

Step S5: each place name in place name set is subordinate to according to the national province, city, county's membership that are saved in step S1 Belong to the scoring of grade relationship, calculates membership score.

Each place name has unique corresponding administrative division code.Including province, city, county.And administrative division code is named Rule is related with relationship between superior and subordinate is subordinate to, therefore this method can determine whether two place names belong to person in servitude by administrative division code Category relationship.

Membership score calculation formula is as follows in step S5:

In this step embodiment, as there is Wuhan City and Wuchang District simultaneously in place name set M, since Wuhan City is Wuchang The upper level place name in area, then Wuchang District obtains membership score.If there was only Wuhan City in place name set, place name is subordinate to Relationship is scored at 0.If occurring Hubei Province and Wuhan City in place name set simultaneously, Wuhan City obtains membership and obtains Point.

Step S6: according to ingredient of the place name each in place name set in sentence, place name component score is obtained.

In abovementioned steps S2, Chinese word segmentation is carried out to text data using right language processing tools, and obtained ground It include sentence element of each place name in text sentence in name set, place name is in sentence as different ingredients, importance Also there is different.In step S6, sentence element of the place name in text data sentence is subject, the adverbial modifier, attribute or object, ground Name component score mode are as follows:

In embodiments of the present invention, place name appears in text data as subject, the adverbial modifier, attribute and object component score Constant is respectively 0.5,0.3,0.3,0.1.For example assume that a place name occurs twice, making respectively in sentence in the text For subject and object, therefore the place name component score of this place name is 0.6 point.

Step S7: four scores of step S3-S6 are added, and obtain the topic correlativity score of place name, and are single with city Position carries out place name polymerization, calculates city and city and has all district place name topic correlativity score summations under its command, and by score from height to It is low to be ranked up.

In this step, the topic correlativity score formula for obtaining place name is s_i=s_if+s_it+s_ir+s_ic, s_iIndicate place name With the degree of correlation of text data theme, the value is higher, indicates that the degree of correlation is higher.

Meanwhile it carrying out polymerizeing place name as unit of city, calculate city and having all district place name score summations under its command, and pressing Score is ranked up from high to low.Such as the Wuhan City place name set Zhong You, Hongshan District and Wuchang District, therefore city is that unit carries out pair Place name polymerization refers to that the Hongshan District and Wuchang District by Wuhan City and its junior count, calculates total place name topic correlativity Score summation.

After place name polymerization, the present invention counts the place name topic correlativity score summation in each city as unit of city, and value is maximum Urban place name, then judge whether maximum value is greater than minimum score value T, in this step, setting core place name is minimum Score value is that the value of T takes 1.6, i.e. T=1.6, if s_i> T, then core place name when the place name is in text data, otherwise be not.

An example is set forth below.

In internet public feelings monitoring system, needs to carry out territorial classification to internet text automatically, text is such as pressed into area Sort out in domain are as follows: Beijing, Shanghai, Wuhan, Guangzhou, Shenzhen etc., to facilitate monitoring personnel to find carriage relevant to affiliated area in time Feelings information.

Assuming that there is following public feelings information:

According to traditional place name extracting mode, " Harbin " " Beijing " " Shanghai " " Guangzhou " " Wuhan " " Xi'an " can be extracted The information can be classified as simultaneously area above by the place names vocabulary such as " Chengdu ", public sentiment monitoring system.

After the method for the present invention, have in place name set first Harbin,Songbei District, Beijing, Shanghai, Guangzhou, Wuhan, west Peace, Chengdu, these place names, wherein Harbin occurs 2 times, frequency score 0.86, other place names occur once, Frequency score is 0.63；Harbin appears in title, therefore HarbinPlace name importance is scored at 1, the weight of other place names The property wanted is scored at 0；In addition occur Harbin andSongbei District belongs to membership, therefore Songbei District membership is scored at 0.5, The membership of other place names is scored at 0；Harbin occurs twice, is the adverbial modifier, and it is also the adverbial modifier that other place names, which occur once, Therefore the place name component score in Harbin is 0.6, other ground entitled 0.3.

Finally counting, the place name topic correlativity in Harbin is scored at 2.46,Songbei District is scored at 1.43. other place names and obtains It is divided into 0.93.Then each urban place name is counted after place name polymerizationTopic correlativity score summation, Harbin score summation are 3.89, Beijing, Shanghai, Guangzhou, Wuhan, Xi'an, Chengdu score summation are 0.93.3.89 are greater than 1.6, therefore public sentiment monitors system for carriage The region of feelings information is classified as in " Harbin ", and Harbin is core place name, other place names occurred will be ignored in text, mentions The high precision of public sentiment territorial classification.

To sum up, scheme only extracts place name purely from text data compared with the existing technology, and the present invention will be in text information The place name of extraction establishes membership, and gives each place name and the text subject degree of correlation, and the place name result of extraction is conducive to calculate Owner's reason and good sense solution text information.Core place name extraction algorithm step in the present invention is simple, should be readily appreciated that and realizes, can quickly answer It uses in production project.

Embodiment two:

The core place name extraction element structure based on natural language processing technique that Fig. 2 shows provided in an embodiment of the present invention Figure, for completing the core place name extracting method based on natural language processing technique, illustrates only and this hair for ease of description The relevant part of bright embodiment.

The core place name extraction element based on natural language processing technique includes such as lower unit:

The corresponding step S1-S8 realized in embodiment one of each functional unit provided in this embodiment, implemented Which is not described herein again for journey.

The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention Made any modifications, equivalent replacements, and improvements etc., should all be included in the protection scope of the present invention within mind and principle.

Claims

1. a kind of core place name extracting method based on natural language processing technique, which is characterized in that the method includes as follows Step:

Step S1: national province, city, the title in county and its administrative division code are made into one-to-one tables of data, and saved It is subordinate to a grade relationship between place name, the title in national province, city, county is made as Chinese word segmentation dictionary；

Step S2: according to the Chinese word segmentation dictionary, using natural language processing tool in one section of specified text data progress Text participle, obtain include sentence element place name set；

Step S3: counting the number that all identical place names occur in place name set, calculates the frequency score that each place name occurs；

Step S5: each place name in place name set is carried out being subordinate to grade according to the national province, city, county's membership that save in step S1 Relationship scoring, calculates membership score；

Step S7: four scores of step S3-S6 being added, the topic correlativity score of place name is obtained, and as unit of city into The polymerization of row place name calculates city and city and has all district place name topic correlativity score summations under its command, and by score from high to low into Row sequence；

Step S8: judging whether topic correlativity score summation peak reaches the minimum score value of core place name, if not up to, Then coreless place name in this article notebook data, if reached, choosing the highest urban place name of topic correlativity score summation is core Heart name.

2. the core place name extracting method based on natural language processing technique as described in claim 1, which is characterized in that step S3 Described in frequency score calculation formula it is as follows:

Wherein S_ifFor frequency score, f_iThe number occurred for i-th of place name in place name set.

3. the core place name extracting method based on natural language processing technique as described in claim 1, which is characterized in that step S4 Described in place name importance score calculation formula it is as follows:

4. the core place name extracting method based on natural language processing technique as described in claim 1, which is characterized in that step S5 Middle membership score calculation formula is as follows:

S_ir=S_r, wherein judging to obtain membership score S if higher level's place name of a place name appears in place name set_r =0.5, on the contrary S_r=0.

5. the core place name extracting method based on natural language processing technique as described in claim 1, which is characterized in that step S6 In, sentence element of the place name in text data sentence is subject, the adverbial modifier, attribute or object, place name component score mode are as follows:

S_ic=S_cz+S_ch+S_cd+S_cb, wherein S_cz、S_ch、S_cd、S_cbPlace name is respectively represented as subject, the adverbial modifier, attribute and object Score, the every appearance of place name is primary to calculate place name component score.

6. a kind of core place name extraction element based on natural language processing technique, which is characterized in that described device includes as follows Unit:

Chinese word segmentation dictionary production unit: for national province, city, the title in county and its administrative division code to be made into an a pair The tables of data answered, and save and be subordinate to a grade relationship between place name, the title in national province, city, county is made as Chinese word segmentation dictionary；

Place name set acquiring unit: it is used for according to the Chinese word segmentation dictionary, using natural language processing tool to a Duan Zhiding Text data carry out Chinese word segmentation, obtain include sentence element place name set；

Frequency score computing unit: for counting the number that all identical place names occur in place name set, each place name is calculated The frequency score of appearance；

Importance score calculation unit: for judging whether place name appears in the title of text data, the place name is calculated Importance score；

Relationship score calculation unit: for according to the national province, city, county's membership saved in step S1 to each in place name set Place name carries out being subordinate to a grade relationship scoring, calculates membership score；

Component score computing unit: for the ingredient according to place name each in place name set in sentence, place name component score is obtained；

Total score sequencing unit: it for being added four frequency score, importance score, relationship score, component score scores, obtains The topic correlativity score of place name is taken, and carries out place name polymerization as unit of city, city is calculated and city has all district place names under its command Topic correlativity score summation, and be ranked up from high to low by score；

Core place name judging unit: for judging whether topic correlativity score summation peak reaches the minimum score of core place name Value, if not up to, coreless place name in this article notebook data chooses topic correlativity score summation highest if reached Urban place name be core place name.