CN111753515A

CN111753515A - Address information extraction and matching method for realizing entity positioning

Info

Publication number: CN111753515A
Application number: CN202010590590.3A
Authority: CN
Inventors: 曾伟英; 霍智杰; 霍凯亮
Original assignee: Guangdong Kejie Communication Information Technology Co ltd
Current assignee: Guangdong Kejie Communication Information Technology Co ltd
Priority date: 2020-06-24
Filing date: 2020-06-24
Publication date: 2020-10-09
Anticipated expiration: 2040-06-24

Abstract

A method for extracting and matching address information for realizing entity positioning comprises the steps of constructing a first conditional random field, and determining the state of the first conditional random field according to administrative level keywords; skipping the address text according to the state of the first conditional random field; dividing the address text according to the state jump, and dividing the address text into a plurality of sub-texts; adding text labels to the divided subfiles on the corresponding administrative levels; constructing and storing a label library according to the text labels; constructing a second random field comprising: adding a new text label to the second address text according to the text label; fuzzy matching is carried out on the address text, and the fuzzy matching comprises the following steps: acquiring a weight value of a text label of a label library; and matching the address text corresponding to the text label most similar to the input address text in the label library according to the weight value. The problem that data cannot be directly associated due to different writing habits of two different original data when address information is input is solved.

Description

Address information extraction and matching method for realizing entity positioning

Technical Field

The invention relates to the technical field of text matching, in particular to an address information extraction and matching method for realizing entity positioning.

Background

Geographic information is the most common social public information resource at present, is closely related to daily life of the masses, and is also a basic resource for government basic administration. The text address refers to a geographical location described by a word, such as "north aster road of sunward area of beijing city". However, when data mining work aiming at various items of data containing address text information is performed, the problem that most address information in original data is recorded irregularly is often faced, so that a bottleneck exists when correlation analysis is performed on massive address texts.

Disclosure of Invention

The invention aims to provide an address information extraction and matching method for realizing entity positioning aiming at the defects in the background technology, which realizes the label extraction conforming to the conventional understanding of massive address texts rapidly, can easily realize the association of data needing address association, and solves the problem that the data cannot be directly associated due to different writing habits of two different original data when the address information is input.

In order to achieve the purpose, the invention adopts the following technical scheme:

an address information extraction and matching method for realizing entity positioning comprises a first address text containing administrative level keywords and a second address text not containing the administrative level keywords, and specifically comprises the following steps:

constructing a first conditional random field applicable to a first address text, comprising:

determining the state of a first conditional random field according to the administrative level keywords;

skipping the address text according to the state of the first conditional random field;

dividing the address text according to the state jump, and dividing the address text into a plurality of sub-texts;

adding text labels to the divided subfiles at the corresponding administrative levels according to the successfully divided address texts;

constructing and storing a label library according to the text labels;

constructing a second random field applicable to the second address text, comprising:

adding a new text label to the second address text according to the text label in the label library;

fuzzy matching is carried out on the address text, and the fuzzy matching comprises the following steps:

acquiring a weight value of a text label of a label library;

and matching the address text corresponding to the text label most similar to the input address text in the label library according to the weight value.

Preferably, the first address texts are graded according to administrative level keywords of the first address texts, and one address text of each grade corresponds to one state of the first conditional random field;

the address texts at the same level are arranged side by side, and the address texts at the lower level are arranged behind the address texts at the higher level.

Preferably, in the state jump process, the high-level state corresponding to the high-level address text jumps to the low-level state corresponding to the low-level address text, and the jump is irreversible;

when the high-level state jumps to the low-level state, all the low-level states of the column in which the low-level state is located are passed;

the states of a single lowest level may jump to each other.

Preferably, in each jump, the address text is divided by using the administrative level keywords corresponding to the level state, and the divided address text enters the next low-level state to be divided again;

selecting a path with the most jumping times, and determining the path as an optimal segmentation path; and the jump times are not counted in the address text jumped across the level states.

Preferably, in the successfully divided address text, the sub-text and the administrative level vocabulary corresponding to each level state are used as text labels.

Preferably, a dictionary is established, the text labels are added to the dictionary according to preset rules, and the dictionary is stored as a two-dimensional data table.

Preferably, the second address text is split word by word, the split previous word and the split next word are combined, the combined words are matched in a label library, whether a text label of the combination exists or not is judged, and if yes, the combination is reserved; if not, the combination is not reserved;

after the combination is reserved, combining the combination with the next character to form a new combination, matching the new combination in a label library, judging whether a text label of the new combination exists, if so, reserving the new combination, continuing to combine the new combination with the next character, and if not, not reserving the new combination;

and so on until all split words can no longer be combined.

Preferably, after the input address text is segmented, weight statistics is performed on all text labels, each text label corresponds to a weight value, and the weight value is in direct proportion to the importance of the text label.

Preferably, the similarity between each text label in the input address text and the text label in the label library is calculated, the similarity and the weight value are weighted and averaged, and the text label in the label library with the maximum value is most similar to the input address text.

Has the advantages that:

the invention realizes the rapid label extraction conforming to the conventional understanding of massive address texts, can easily realize the association of the data needing address association, and solves the problem that the data cannot be directly associated due to different writing habits of two different original data when the address information is input.

Drawings

FIG. 1 is a block diagram of a model of one embodiment of the present invention;

FIG. 2 is a flow chart of one embodiment of the present invention;

FIG. 3 is a conditional random field state transition diagram of one embodiment of the present invention.

Detailed Description

The technical scheme of the invention is further explained by the specific implementation mode in combination with the attached drawings.

In the description of the present invention, it is to be understood that the terms "upper", "lower", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", and the like, indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, are only for convenience in describing the present invention and simplifying the description, and do not indicate or imply that the device or element referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be construed as limiting the present invention.

The invention relates to an address information extraction and matching method for realizing entity positioning, which comprises a first address text containing administrative level keywords and a second address text not containing the administrative level keywords, as shown in figures 1 and 2, and comprises the following specific steps:

constructing and storing a label library according to the text labels;

acquiring a weight value of a text label of a label library;

The method is mainly divided into two parts, wherein the first part is used for constructing a conditional random field for segmenting and extracting information; and the other part is the construction of a label library after the text extraction. The conditional random field is further divided into a first conditional random field and a second conditional random field which are respectively used for carrying out segmentation of two different types of texts, wherein one type of text is a first address text containing administrative level keywords and a second address text not containing the administrative level keywords, the first address text is for example 'three-way ring lake primary school of Buddha city district and green river of Buddha city of Guangdong province', and the second address text is for example 'three-way ring lake primary school of Guangdong Buddha mountain and green river'. The method comprises the steps of constructing a first conditional random field, segmenting a mass of texts by using the first conditional random field, constructing a tag library, and constructing a second conditional random field based on the tag library. When a new input text exists, the two conditional random fields are used for segmentation and extraction, fuzzy matching is carried out on the text based on a LenvenshiteinDistance algorithm, and an approximate text is returned, wherein the approximate text is the similar text which can be understood by people. The problem that address texts cannot be associated due to writing habits when the address texts belong to the same address in semantic understanding although the writing modes of the addresses input by different data are inconsistent in daily data mining work is solved.

As shown in FIG. 3, a first condition is determined for the random field state: in china, all addresses may be divided according to administrative levels, such as autonomous districts, special administrative districts, provinces, cities, districts, counties, towns, street offices and the like, and classified according to administrative level keywords, such as "district" and "county" siblings, and therefore should be arranged in parallel, while "street offices" and "towns" siblings but lower in level than "district, county", and therefore are arranged in parallel after "district, county", and other levels form various states based on which the first condition is random. In the actual use process, states corresponding to the administrative levels can be increased or decreased according to actual requirements, and each first address text to be extracted passes through the first conditional random field.

the states of a single lowest level may jump to each other.

The first condition is a state jump of the random field: the first conditional random field is characterized in that the state jump must be from front to back according to the characteristics, namely, the high-level address text jumps to the low-level administrative address text, and the high-to-low jump must go through every possible state, for example, a column in which the 'district' is located can only jump to a column in which the 'street office' is located, namely, the 'district' must only go through the 'street office' and the 'town', the 'street office' can only go through the 'committee for living' and the 'village committee' when jumping, and a loop exists only in the lowest-level state of the 'number'.

In the method, each level of administration level is regarded as a state, and the jump process can only be from high level to low level and can not be reversed. And each jump divides the address text by using the 'administrative level vocabulary' corresponding to the state. And the divided subfolders are divided into a new state. If the division fails, such as a portion of the text is written directly from "city" to "street office" without writing "zone" information, the cross-level status is no longer included in the calculation. After the first conditional random field is segmented, the path with the largest jumping times is selected to be the optimal segmentation path.

For example, "three routes loop lake primary school in green view of Buddha city, Guangdong province", this address text will not only pass through the status jump route of "province", "city", "district" and "road", but also will pass through the status jumps of "municipality", "city" and "county" … …, but at the "municipality", it is not divided by the status of "municipality", so the "municipality" status is directly passed and enters "city" status. In the result of the path divided by the "autonomous region", the number of times of successful division is necessarily less than that of the path divided by the "province", so that the path divided by the "autonomous region" is not necessarily the best path.

For the successfully segmented address text, the segmented sub-text can be added with text labels on corresponding administrative boundaries, such as the address text "three-way ring lake primary school" in Buddha city, Guangdong province, and the level segmentation of "province" is that of "Guangdong" and "city" is that of "Buddha mountain".

Building a label library: establishing a blank dictionary which is a data structure, adding extracted label texts in a key value pairing mode, for example, the 'three-way ring lake primary school' of green view of Buddha city in Fushan city in Guangdong province, wherein the divided 'Guangdong' child text corresponds to the { 'text': "guangdong", "level": "province", "count": "+ 1" } (+1 means a gradual count accumulation on the original basis), and so on for the remaining tags.

And (3) expanding the tag library: and when the subsequent label text with the 'Guangdong' appears, the 'Guangdong' is counted by a key value and is increased by one, and if the label text does not appear, the label text is added as a new key value pair.

And (3) storage of the label library: and periodically storing the updated dictionary as a two-dimensional data table, wherein the two-dimensional data is very efficient in accessing internal data and is prepared for the subsequent establishment of a second random condition field.

and so on until all split words can no longer be combined.

The second conditional random field is for text segmentation without filling any administrative keywords, such as "south sea of Buddha city" written as "south sea of Buddha". Therefore, the state of the second conditional random field depends on the label extracted from the first random field, and we use the label library, for such address text, split word by word, then add each word and its following words as a new subfile, jump into a new state, count the probability of each state jump in the label library, and take the most probable path as the optimal segmentation mode. It should be noted that, finally, through the best path, the corresponding administrative label of each state is found.

Splitting word by word: for example, "three-way ring lake primary school of green scene of Guangdong Buddha mountain", split: "Guang", "east" and "Buddha" … …

Initializing a combined address text: and for the disassembled Chinese character group, recombining, namely firstly merging the 'Guang' and the 'east', if the 'Guangdong' exists in the label library, keeping the combination, continuously merging to form the 'Guangdong Buddha', but the 'Guangdong Buddha' obviously does not exist in the label library, skipping the combination, continuously merging the 'Guangdong Buddha' until the whole address text is traversed, and forming the state of the second conditional random field by the combinations.

Selection of the best division: based on all the division combinations, the combination of the root of the occurrence frequency in the tag library and the maximum value of the product of the character string length is reserved, for example, 10000 times occur in the Guangdong Buddha, and 1000000 times occur in the Guangdong, so that the optimal division is still the Guangdong.

Word forming and division iteration: when the 'Guangdong' is the best combination, the subsequent other texts contain characters of the 'Guangdong', and the division is performed by default to take the 'Guangdong' as the best granularity. For the primary schools of the three-way ring lakes of the 'Guangdong Fushan green scenery', after the Guangdong is divided, the best division and selection are continuously carried out on the Chinese character groups behind the 'Guangdong' until the Chinese character groups can not be combined any more, the rest Chinese characters form new combinations, and the new combinations are added into a label library.

Data tagging of the new partition: the text is divided into "Guangdong" and the corresponding label is "province", then "Guangdong province" is taken as one of the labels of the text "Guangdong Buddha green scene three-way ring lake primary school", and the rest is analogized in the same way.

As mentioned above, the text is fuzzy-matched based on the lenvenshitin Distance algorithm, and an approximate text is returned, and the approximate text is similar text which can be understood by us. The method uses Lenvenshitein Distance as the fuzzy matching of the address, firstly, a weight statistic is carried out on the extracted text label in a label library, and the weight statistic mode used at this time is TFIDF, which is the prior art and is not described herein again. For example, after the conditional random field and TFIDF calculation, one of the address texts, namely "south sea of Buddha, Guilanlu 18", is used, the weight of "Buddha" is 0.15, the weight of south sea of.

When a new address text is received, firstly, the new address text is segmented, all administrative-level labels are obtained by utilizing the conditional random field segmentation, then, TFIDF calculates the weights of the text labels of all the administrative levels, finally, Lenvenshitin Distance of the text labels corresponding to the input address text and partial address texts in a label library is calculated, the final value is weighted and averaged, and the maximum address text is the address text most similar to the input address.

Examples are: the address of the ' 5 th 607 th room of the southern sea area of Buddha mountain's city ' cannot be recorded in the extracted tag library, so that the ' 5 th 607 th room of the Japanese sea area ' cannot appear in the tag library according to the first conditional random field and the second conditional random field (although the tags of the ' Buddha mountain city ', ' the southern sea area ', ' the Guanlan road ' and the like can be in the library and obtained by extracting other text tags in the same region), and the address of the ' 3 th 205 th room of the southern sea area of Buddha mountain's city's Lanlan ' may be extracted. After the address is divided and weighted, the weight of the 'Pigui garden 5 seat 607 room' is the highest, which means that the influence on the address is the largest, fuzzy matching of Lenvenshitin Distance is carried out on each state according to levels, the character string similarity of each state is generated, and finally, the weighted average of all the similarity and TFIDF is carried out to obtain the result. For two addresses of ' the 3 th room 205 of the Gui lan in the south sea area of Buddha city ' and ' the 5 th room 607 of the Gui lan in the south sea area of Buddha city ', after comparing the result of the product of the similarity rate of the character strings of the ' the 205 th room of the Gui garden ' and ' the 5 th room 607 of the Yi cloud ' and the weight of the ' the Gui garden 5 th room 607 ' respectively, it is determined who ' is more similar. By extracting labels segmented according to levels to perform local similarity matching, errors caused by inconsistent writing habits but similar semantics can be effectively avoided, and the situation of excessive fuzzy matching (starting from similar character string arrangement, which may result in extreme similarity on character strings but actually two different addresses, such as '4 < 201 > rooms on Longguangtian lake of lake-Changchan-district road |,' 4 < 201 > rooms on Longguangtian lake of green-scenic road of Changchan-district-China-Huafu) is avoided. Note: the addresses are all fictional and are convenient for explanation and use.

The technical principle of the present invention is described above in connection with specific embodiments. The description is made for the purpose of illustrating the principles of the invention and should not be construed in any way as limiting the scope of the invention. Based on the explanations herein, those skilled in the art will be able to conceive of other embodiments of the present invention without inventive effort, which would fall within the scope of the present invention.

Claims

1. An address information extraction and matching method for realizing entity positioning is characterized in that: the method comprises a first address text containing administrative level keywords and a second address text not containing the administrative level keywords, and specifically comprises the following steps:

constructing and storing a label library according to the text labels;

acquiring a weight value of a text label of a label library;

2. The method for extracting and matching address information for implementing entity location as claimed in claim 1, wherein:

grading the first address text according to administrative level keywords of the first address text, wherein one address text of each level corresponds to one state of the first conditional random field;

3. The method for extracting and matching address information for implementing entity location as claimed in claim 1, wherein:

in the process of state jump, the high-level state corresponding to the high-level address text jumps to the low-level state corresponding to the low-level address text, and the jump is irreversible;

the states of a single lowest level may jump to each other.

4. The method for extracting and matching address information for implementing entity location as claimed in claim 1, wherein:

in each jump, dividing the address text by using the administrative level keywords corresponding to the level state, and performing secondary division on the divided address text in the next low-level state;

5. The method for extracting and matching address information for implementing entity location as claimed in claim 1, wherein:

in the successfully divided address texts, the sub-texts and the administrative level vocabularies corresponding to each level state are used as text labels.

6. The method for extracting and matching address information for implementing entity location as claimed in claim 1, wherein:

and establishing a dictionary, adding the text labels to the dictionary according to a preset rule, and storing the dictionary into a two-dimensional data table.

7. The method for extracting and matching address information for implementing entity location as claimed in claim 1, wherein:

splitting the second address text word by word, combining the split previous word with the split next word, matching the split previous word and the split next word in a tag library after combination, judging whether a text tag of the combination exists or not, and if so, keeping the combination; if not, the combination is not reserved;

and so on until all split words can no longer be combined.

8. The method for extracting and matching address information for implementing entity location as claimed in claim 1, wherein:

after the input address text is segmented, weight statistics is carried out on all text labels, each text label corresponds to a weight value, and the weight value is in direct proportion to the importance of the text label.

9. The method for extracting and matching address information for implementing entity location as claimed in claim 8, wherein:

and calculating the similarity between each text label in the input address text and the text label in the label library, and calculating the weighted average of the similarity and the weight value, wherein the text label in the label library with the maximum value is most similar to the input address text.