CN110399613B

CN110399613B - Method and system for identifying internet news related to place names based on part-of-speech tagging

Info

Publication number: CN110399613B
Application number: CN201910681163.3A
Authority: CN
Inventors: 苏坤雄; 彭光
Original assignee: Inspur Software Co Ltd
Current assignee: Inspur Software Co Ltd
Priority date: 2019-07-26
Filing date: 2019-07-26
Publication date: 2023-03-31
Anticipated expiration: 2039-07-26
Also published as: CN110399613A

Abstract

The invention discloses internet news based on part of speech tagging, relates to a place name identification method and a system, and belongs to the technical field of natural language processing. The invention relates to a location name identification method of internet news based on part of speech tagging, which utilizes the general reporting region of news media columns to supplement the context information of news, assists a location name disambiguation program to correctly judge the location name, utilizes part of speech tagging to convert news content into a pure noun phrase sequence, identifies the location name of the noun phrase sequence, reduces the location name of the location name identification result twice, eliminates inaccurate location names, and finally weights and summarizes the two location name reduction results to confirm the location name. The internet news related to place name identification method based on part of speech tagging is popular and easy to understand, is simple in implementation process, can effectively solve the problem that news related to place name extraction is accurate and low, and has good popularization and application values.

Description

Method and system for identifying internet news related to place names based on part-of-speech tagging

Technical Field

The invention relates to the technical field of natural language processing, in particular to a method and a system for identifying internet news related to place names based on part of speech tagging.

Background

Place name recognition is a subclass of the field of entity recognition in the field of natural language processing. The traditional place name recognition technology relates to a place name level model established based on an administrative division dictionary, a word segmentation algorithm, place name disambiguation and other technologies. Typically, place name recognition techniques use a forward maximum matching segmentation algorithm to maximize matching place names. Meanwhile, in the place name identification process, the main stream place name disambiguation means introduces context information through hierarchical clustering to solve the address renaming ambiguity problem.

However, the current place name identification method has certain problems in identifying the place name related to the internet news:

(1) The place name recognition technology based on the place name hierarchical model and the forward maximum matching word segmentation algorithm finds that the place name recognition result is inaccurate in the practical process, such as matching the place name recognition result to the region 'Gaoan county' in the phrase 'improving safety' and matching the place name recognition result to the region 'Crane city and agricultural district' in the phrase 'industry and agriculture';

(2) In a local news media report, vocabularies such as 'my city' and 'my county' often appear, but corresponding place name nouns do not appear, and under the condition, the simple place name identification technology cannot judge the place name related to the news;

(3) In local news media reports, there are cases where only a certain region is indicated and other regions are not described, such as administrative divisions at the level of "city middle region" and "flat region", and the name disambiguation program cannot correctly determine the region because the administrative divisions at the level of this region have more names and no other context information.

Disclosure of Invention

The technical task of the invention is to provide a method for identifying the names of the internet news related to the places based on part of speech tagging, which is popular and easy to understand, has a simple implementation process and can effectively solve the problem that the extraction of the names of the news related to the places is accurate and low.

The invention further provides a system for identifying the internet news related to the place name based on the part of speech tagging.

In order to realize the purpose, the invention provides the following technical scheme:

a method for recognizing location names of Internet news based on part-of-speech tagging relates to the method that context information of news is supplemented by the aid of overall reporting areas of news media columns, location names are judged correctly by aid of location name disambiguation programs, news contents are converted into plain noun phrase sequences by means of part-of-speech tagging, location name recognition is conducted on the noun phrase sequences, location name recognition results are reduced twice, inaccurate location names are eliminated, and finally weighting and summarizing are conducted on the two location name reduction results to confirm the location names.

Preferably, the method specifically comprises the following steps:

s1, determining a total reporting region of a media column: place names with absolute proportion occupied by the occurrence times of place names pointed by all reports under the column of the news media;

s2, acquiring a noun phrase sequence: using part-of-speech tagging for the transmitted title input and text input to obtain a noun phrase sequence;

s3, first geographical name reduction: pruning based on each individual noun phrase in the noun phrase sequence;

s4, second place name reduction: on the basis of first place name reduction, carrying out secondary word segmentation on the recognition result of place name recognition and a corresponding noun word group sequence, and reducing;

s5, acquiring and identifying a place name: and adding the first place name reduction result and the second place name reduction result into weight combination, and confirming the related place names after weighted aggregation.

Preferably, step S1 determines a total reporting area of the media column, complements contextual information of news for geographical name disambiguation, and connects the total reporting area of the media column with a space to a news title and a news text, respectively, to obtain a title input and a text input.

If the proportion of the appeared place names is more balanced, the corresponding upper-level place name is selected as the total report area of the column.

Preferably, in step S2, in the noun phrase sequence, the punctuations in the original text and the words of part of speech other than the noun are replaced with spaces.

Preferably, in step S3, place name recognition is performed on the noun phrase sequence of the title input and the text input, and place names having no relationship between upper and lower levels recognized later in each noun phrase are subtracted.

In the place name reduction, firstly, place name recognition is carried out on the noun phrase sequence of the title input and the text input, and place names which are recognized in each noun phrase and have no upper-lower level relation are reduced.

Preferably, in step S4, performing secondary word segmentation on the recognition result of the place name recognition and the corresponding noun phrase sequence, and reducing, if each word after the word segmentation of the place name recognition result exists continuously in the word segmentation result of the corresponding noun phrase sequence, determining that the recognition is accurate, otherwise, removing the recognition result.

The system for recognizing the place name of Internet news based on part of speech tagging comprises the following modules:

the system comprises a media column general report region determining module, a news media searching module and a reporting module, wherein the general report region determining module is used for determining the place names of which the occurrence times of the place names pointed by all reports in the news media in the column occupy absolute proportion;

a noun phrase sequence acquisition module, which is used for using part-of-speech tagging to the transmitted title input and text input to acquire a noun phrase sequence;

the first place name reduction module is used for reducing each individual noun phrase in the noun phrase sequence;

the second place name reduction module is used for carrying out secondary word segmentation on the recognition result of the place name recognition and the corresponding noun word group sequence on the basis of the first place name reduction, and reducing;

and the identification place name acquisition module is used for adding the first place name reduction result and the second place name reduction result into weight combination, and confirming the related place names after weighted aggregation.

Preferably, the module for determining the overall reporting region of the media column is used for disambiguating the place name and supplementing the context information of news, and the overall reporting region of the media column plus a blank space is respectively connected with the news title and the news text to obtain the title input and the text input.

Preferably, in the noun phrase sequence acquiring module, in the noun phrase sequence, the punctuations in the original text and the words of part of speech except the noun words are replaced by spaces.

Preferably, the first place name reduction module performs place name recognition on the noun phrase sequence of the title input and the text input, and reduces place names which are recognized later in each noun phrase and have no upper-lower level relation; and the second place name reduction module carries out secondary word segmentation on the recognition result of the place name recognition and the corresponding noun word group sequence on the basis of the first place name reduction, and reduces the number of the noun word group sequence.

Compared with the prior art, the internet news related to place name identification method based on part of speech tagging has the following outstanding advantages: the internet news related to the place name identification method based on the part of speech tagging is popular and easy to understand, the implementation is simple, the problem that the extraction accuracy of the news related to the place name is low is effectively solved by utilizing the overall report region of the news media column and reducing the place name identification result twice, the accuracy can reach more than 90% when the internet news related to the place name is displayed in actual use, and the method has good popularization and application values.

Drawings

Fig. 1 is a flowchart of a part-of-speech tagging-based internet news related to place name recognition method according to the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and embodiments, wherein the method and system for identifying a place name in internet news based on part of speech tagging are described in detail below.

Examples

As shown in figure 1, the internet news based on part-of-speech tagging relates to a place name recognition method, and the method comprises the steps of supplementing context information of news by using a general reporting region of a news media column, assisting a place name disambiguation program to correctly judge a place name, converting news contents into a pure noun phrase sequence by using part-of-speech tagging, carrying out place name recognition on the noun phrase sequence, carrying out twice place name reduction on a place name recognition result, eliminating an inaccurate place name, and finally carrying out weighted summarization on twice place name reduction results to confirm the place name.

The method specifically comprises the following steps:

s1, determining a total report region of a media column: the news media shows the place names with absolute proportion of the number of occurrences of the place names pointed by all reports under the column.

If the proportion of the appeared place names is more balanced, the corresponding superior place name is selected as the total reporting area of the column. The general reporting region is used for disambiguating place names and supplementing context information of news, and the general reporting region of the media column is connected with a news title and a news text by adding a space respectively to obtain title input and text input.

S2, acquiring a noun phrase sequence: and using part-of-speech tagging for the incoming title input and text input to acquire a noun phrase sequence.

In the noun phrase sequence, punctuations in the original text and words of parts of speech except the noun are replaced by spaces.

For example, the text' Laiwu city steel urban district people inspection institute develops judicial rescue work, issues judicial rescue gold for criminal victims by 3 ten thousand yuan, protects lawful rights and interests of the criminal victims, and ensures that criminal litigation activities are smoothly carried out. The noun phrase sequence is the legal title and criminal action of law rescue work criminal victim and law rescue criminal victim in the national institute of civil inspection of steel city in Laiwu city.

S3, first place name reduction: the reduction is based on each individual noun phrase in the sequence of noun phrases.

And carrying out place name recognition on the noun phrase sequences of the title input and the text input, and eliminating place names which are recognized later in each noun phrase and have no upper-lower level relation.

In the place name reduction, firstly, place name recognition is carried out on the noun phrase sequence of the title input and the text input, and place names which are recognized in each noun phrase and have no upper-lower level relation are reduced. For example, the noun phrase "south part of Longyan" has place name recognition results of "city of Longyan" and "south county". Since "southern county" and "Longyan City" have no top-bottom attribution relationship, the recognition result of "southern county" is removed in the reduction of the place name.

S4, the second place name reduction: and on the basis of the first place name reduction, carrying out secondary word segmentation on the recognition result of the place name recognition and the corresponding noun phrase sequence, and reducing. And if each word after the division of the place name recognition result continuously exists in the division result of the corresponding noun phrase sequence, the recognition is considered to be accurate, otherwise, the recognition result is removed.

For example, the place name recognition result ' Gongnong ' in the Crane market ' corresponds to word segmentation results of ' Gongnong ' and ' Gongnong '; the noun phrase corresponding to the place name recognition result is the total value of industrial and agricultural production in the city of the Crane, and the word segmentation result is the total value of industrial and agricultural production in the city of the Crane. The word segmentation result of the place name recognition result is not continuous and equal to the word segmentation result of the noun word group sequence, and the place name recognition result of the time can be eliminated.

In practice, the situation that a certain reduction is excessive exists in the second reduction process, so that the results of the first and second reduction of the place names are added into a weight set, administrative divisions at province, city and district and county levels in the weight set have fewer duplication conditions, and the corresponding administrative division weight is increased by one every time the results of the two reductions appear; every time the administrative divisions of the village, town and village level appear, the weight is increased by 0.6. And finally outputting all administrative divisions with the maximum weight in the weight set.

The invention relates to a system for identifying a place name based on internet news labeled by parts of speech, which comprises the following modules:

the media column comprises a general report region determining module used for determining the place names of which the occurrence times occupy absolute proportion in the place names pointed by all reports under the column of the news media.

The overall reporting region determining module of the media column supplements the contextual information of news for the place name disambiguation, and connects the overall reporting region of the media column with a space respectively to a news title and a news text to obtain a title input and a text input.

And the noun phrase sequence acquisition module is used for using part-of-speech tagging on the transmitted title input and text input to acquire a noun phrase sequence.

In the noun phrase sequence obtaining module, in the noun phrase sequence, punctuations in the original text and words of part of speech except the noun are replaced by spaces.

And the first place name reduction module is used for reducing each individual noun phrase in the noun phrase sequence.

The first place name reduction module performs place name recognition on the noun phrase sequence of the title input and the text input, and reduces place names which are recognized later in each noun phrase and have no upper-lower level relation.

And the second place name reduction module is used for carrying out secondary word segmentation on the recognition result of the place name recognition and the corresponding noun word group sequence on the basis of the first place name reduction and reducing.

And the second place name reduction module carries out secondary word segmentation on the recognition result of the place name recognition and the corresponding noun word group sequence on the basis of the first place name reduction, and reduces the number of the noun word group sequence.

The above-described embodiments are merely preferred embodiments of the present invention, and general changes and substitutions by those skilled in the art within the technical scope of the present invention are included in the protection scope of the present invention.

Claims

1. A method for identifying internet news related to place names based on part of speech tagging is characterized by comprising the following steps: the method utilizes the general report region of a news media column to supplement the context information of news, assists a place name disambiguation program to correctly judge place names, utilizes part-of-speech tagging to convert news contents into a pure noun phrase sequence, identifies the place names of the noun phrase sequence, reduces the place names twice according to the place name identification result, eliminates inaccurate place names, and finally carries out weighted summarization on the two place name reduction results to confirm the place names, and specifically comprises the following steps:

s1, determining a total report region of a media column: place names with absolute ratio of occurrence times in place names pointed by all reports in the news media in the column;

s2, acquiring a noun phrase sequence: using part-of-speech tagging for the incoming title input and text input to obtain a noun phrase sequence;

s3, first place name reduction: reducing each individual noun phrase in the noun phrase sequence, performing place name recognition on the noun phrase sequence of the title input and the text input, and reducing place names which are recognized later and have no upper-lower level relation in each noun phrase;

s4, second place name reduction: performing secondary word segmentation on the recognition result of the place name recognition and the corresponding noun word group sequence on the basis of primary place name reduction, reducing the recognition result of the place name recognition and the corresponding noun word group sequence, and reducing the recognition result of the place name recognition and the corresponding noun word group sequence;

2. The method for recognizing the place name related to the internet news based on the part of speech tagging as claimed in claim 1, wherein: step S1, determining a total reporting region of a media column, disambiguating a place name, supplementing context information of news, and connecting the total reporting region of the media column with a space respectively to a news title and a news text to obtain title input and text input.

3. The method for recognizing internet news related to place names based on part-of-speech tagging as claimed in claim 2, wherein: in step S2, in the noun phrase sequence, the punctuations in the original text and the words of part of speech except the noun are replaced with spaces.

4. The utility model provides an internet news relates to place name identification system based on part of speech mark which characterized in that: the system comprises the following modules:

the system comprises a media column total report region determining module, a news media searching module and a news media searching module, wherein the media column total report region determining module is used for determining the place names of which the occurrence times of the place names pointed by all reports under the column of the news media occupy absolute proportion; the noun phrase sequence acquisition module is used for using part-of-speech tagging on the transmitted title input and text input to acquire a noun phrase sequence;

the system comprises a recognition place name acquisition module, a recognition place name reduction module and a recognition place name recognition module, wherein the recognition place name acquisition module is used for adding first place name reduction results and second place name reduction results into weight combination, and confirming related place names after weighting and summarizing, the first place name reduction module performs place name recognition on a noun phrase sequence of title input and text input, and reduces place names which are recognized in each noun phrase and have no upper-lower level relation; and the second place name reduction module carries out secondary word segmentation on the recognition result of the place name recognition and the corresponding noun word group sequence on the basis of the first place name reduction, and reduces the number of the noun word group sequence.

5. The system of claim 4, wherein the internet news based on part-of-speech tagging relates to a place name recognition system, and wherein: the overall reporting region determining module of the media column supplements the contextual information of news for the place name disambiguation, and connects the overall reporting region of the media column with a space respectively to a news title and a news text to obtain a title input and a text input.

6. The system of claim 5, wherein the internet news based on part-of-speech tagging relates to a place name recognition system, and wherein: in the noun phrase sequence obtaining module, in the noun phrase sequence, punctuations in the original text and words of part of speech except the noun are replaced by spaces.