CN109344254B - Address information classification method and device - Google Patents

Address information classification method and device Download PDF

Info

Publication number
CN109344254B
CN109344254B CN201811102935.5A CN201811102935A CN109344254B CN 109344254 B CN109344254 B CN 109344254B CN 201811102935 A CN201811102935 A CN 201811102935A CN 109344254 B CN109344254 B CN 109344254B
Authority
CN
China
Prior art keywords
address information
search algorithm
processed
classified
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811102935.5A
Other languages
Chinese (zh)
Other versions
CN109344254A (en
Inventor
李胜
单培
李士勇
张瑞飞
李广刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Science and Technology (Beijing) Co., Ltd.
Original Assignee
Dingfu Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dingfu Intelligent Technology Co Ltd filed Critical Dingfu Intelligent Technology Co Ltd
Priority to CN201811102935.5A priority Critical patent/CN109344254B/en
Publication of CN109344254A publication Critical patent/CN109344254A/en
Application granted granted Critical
Publication of CN109344254B publication Critical patent/CN109344254B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides an address information classification method and device, wherein the method extracts all address information to be processed in a text; determining the integrity type of each piece of address information to be processed according to each piece of address information to be processed; according to the integrity type of each piece of address information to be processed and the position of the address information to be processed in the text, obtaining address information to be classified corresponding to each piece of address information to be processed by utilizing a forward search algorithm and a backward search algorithm, wherein the address information to be classified is complete address information; and classifying the address information to be classified by using the context information of the address information to be classified to obtain the category corresponding to the address information to be classified. Therefore, whether the extracted address information is complete or not, the complete address information can be finally obtained and accurately classified, and the accuracy of the classification result is improved.

Description

Address information classification method and device
Technical Field
The present application relates to the field of text processing, and in particular, to an address information classification method and apparatus.
Background
Future human-computer interaction data will increasingly relate to address information, and the internet becomes a constantly updated address information data warehouse, and a large amount of formal and non-canonical address information is gathered. The industry related to address information has higher and higher demand for address information data, so as to provide support for analysis, research and decision of various services. Therefore, how to effectively extract address description information from a text context and accurately classify the address description information is a necessary and highly practical task.
The existing processing method is that firstly, address information is extracted by using an address information extraction method based on the biLSTM technology, and then the extracted address information is classified; however, the biLSTM technology requires a large amount of accurate labeling information, such as manual labeling, which increases labor cost and is not portable. When the machine is used for labeling, the conditions of inaccurate or incomplete labeling and the like exist, so that the extraction result is inaccurate, and finally, an incorrect classification result is obtained.
Disclosure of Invention
The application provides an address information classification method and device, which aim to solve the problem that an error classification result is easily obtained by using the existing address classification method.
In a first aspect, the present application provides an address information classification method, including:
extracting all address information to be processed in the text;
determining the integrity type of each piece of address information to be processed according to each piece of address information to be processed, wherein the integrity type of the address information to be processed comprises positive address information and negative address information, the positive address information comprises complete or partial address information, and the negative address information comprises address information containing other words;
according to the integrity type of each piece of address information to be processed and the position of the address information to be processed in the text, obtaining address information to be classified corresponding to each piece of address information to be processed by utilizing a forward search algorithm and a backward search algorithm;
classifying each address information to be classified by using the context information of each address information to be classified to obtain a category corresponding to each address information to be classified;
and outputting each address information to be classified and the corresponding category.
In a second aspect, the present application provides an address information classifying device, the device including:
the extraction module is used for extracting all address information to be processed in the text;
the determining module is used for determining the integrity type of each piece of address information to be processed according to each piece of address information to be processed, wherein the integrity type of the address information to be processed comprises positive address information and negative address information, the positive address information comprises complete or partial address information, and the negative address information comprises address information containing other words;
the address to be classified determining module is used for obtaining the address information to be classified corresponding to each address information to be processed by utilizing a forward search algorithm and a backward search algorithm according to the integrity type of each address information to be processed and the position of the address information to be processed in the text;
the classification module is used for classifying the address information to be classified by utilizing the context information of the address information to be classified to obtain the category corresponding to the address information to be classified;
and the output module is used for outputting each address information to be classified and the corresponding category.
According to the technical scheme, the method comprises the steps of firstly extracting address information in a text as address information to be processed, obtaining a complete address to be classified by utilizing a forward search algorithm and a backward search algorithm according to the integrity of the address information to be processed and the position of the address information to be processed in the text, and then classifying the address to be classified by utilizing context information of the address to be classified. Therefore, whether the extracted address information is complete or not, the complete address information can be finally obtained and accurately classified, and the accuracy of the classification result is improved.
Drawings
In order to more clearly explain the technical solution of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious to those skilled in the art that other drawings can be obtained according to the drawings without any creative effort.
Fig. 1 is a flowchart of an embodiment of an address information classification method provided in the present application;
fig. 2 is a flowchart of another embodiment of an address information classification method provided in the present application;
fig. 3 is a schematic structural diagram of an address information classification apparatus provided in the present application;
fig. 4 is a schematic structural diagram of an embodiment of an address determination module to be classified.
Fig. 5 is a schematic structural diagram of another embodiment of the address determination module to be classified.
Fig. 6 is a schematic structural diagram of a first search algorithm unit.
Detailed Description
In a first aspect, referring to fig. 1, an embodiment of the present application provides an address information classification method, where the method includes the following steps:
step 101: and extracting all address information to be processed in the text.
The extraction of the address information to be processed in the text can be completed by using an address information extraction model. Specifically, a Chinese word segmentation system is used for performing word segmentation and part-of-speech tagging on enough training texts item by item, and then a bilSTM model is used for training the training texts, so that an address extraction model is generated. The worker can extract the address information in the text by using the model.
Step 102: and determining the integrity type of each piece of address information to be processed according to each piece of address information to be processed, wherein the integrity type of the address information to be processed comprises positive address information and negative address information, the positive address information comprises complete or partial address information, and the negative address information comprises address information containing other words.
Step 103: and obtaining address information to be classified corresponding to each address information to be processed by utilizing a forward search algorithm and a backward search algorithm according to the integrity type of each address information to be processed and the position of the address information to be processed in the text, wherein the address information to be classified is complete address information.
The boundary of the address information to be processed can be accurately marked out by combining the forward search algorithm and the backward search algorithm, and the accuracy and the integrity of subsequent data processing can be improved.
Step 104: and classifying the address information to be classified by using the context information of the address information to be classified to obtain the category corresponding to the address information to be classified.
Step 105: and outputting each address information to be classified and the corresponding category.
According to the technical scheme, the address information in the text is extracted as the address information to be processed, the forward search algorithm and the backward search algorithm are utilized according to the integrity of the address information to be processed and the position of the address information to be processed in the text to obtain the complete address to be classified, and then the context information of the address to be classified is utilized to classify the address to be classified. Therefore, whether the extracted address information is complete or not, the complete address information can be finally obtained and accurately classified, and the accuracy of the classification result is improved.
Referring to fig. 2, an address information classification method provided in another embodiment of the present application includes the following steps:
step 201: and extracting all address information to be processed in the text.
The extraction of the address information to be processed in the text can be completed by using an address information extraction model. Specifically, a Chinese word segmentation system is used for performing word segmentation and part-of-speech tagging on enough training texts item by item, and then a bilSTM model is used for training the training texts, so that an address extraction model is generated. The worker can extract the address information in the text by using the model.
Step 202: and determining the integrity type of each piece of address information to be processed according to each piece of address information to be processed, wherein the integrity type of the address information to be processed comprises positive address information and negative address information, the positive address information comprises complete or partial address information, and the negative address information comprises address information containing other words.
The address information extracted by the address information extraction model may be the complete address information in the text, or partial address information, or address information including other words. For example, the text is that a certain (household registration: the x unit x number of the x unit of the dd street e cell x in the CC area of BB city AA province, identity number: xxxxxxxxxxx) report states that the x unit x number of the x unit of the H cell x in the G town H cell of the CC area of BB city AA province is stolen, the door lock is intact, and the safe in the home is pried, and if the result extracted by the address information model is the x unit x number of the dd street e cell x in the CC area of BB city AA province and the H cell in the CC area G town H cell, the x unit x number of the x unit x in the dd street cell e cell x in the CC area of BB city AA province is complete address information, namely forward address information; "CC area G town H cell" is part of address information and also belongs to forward address information. If the extraction result is that the x unit x number is stolen, and the term of being stolen is included in the x unit x number, the x unit x number is negative address information.
Step 203: if the address information to be processed is forward address information, searching in a first direction corresponding to the first search algorithm from the position of the address information to be processed in the text, and combining an adjacent word with the address information to be processed to obtain combined address information, wherein when the first search algorithm is a forward search algorithm, the first direction is the forward direction; when the first search algorithm is a backward search algorithm, the first direction is a backward direction.
Step 204: and if the merged address information is forward address information, determining the merged address information as address information to be processed, and jumping to step 203 until a preset stop symbol adjacent to the address information to be processed is searched in a first direction.
Step 205: if the merged address information is negative address information, recording the continuous times of judging as the negative address information, determining the merged address information as address information to be processed, and jumping to step 203 until the continuous times of judging as the negative address information is equal to the preset continuous times, or searching to a first direction until a preset stop symbol adjacent to the address information to be processed is reached.
Step 206: and determining the address information to be processed which is determined as the forward address information at the last time as first target address information.
The preset stop symbol may be set by a worker, for example, a comma, a semicolon, or the like. Continuing with the text example in the above example, assume that the results extracted using the address information model are "aa province bb city CC area dd street e cell x number x unit x" and "CC area G town H cell". And when the first search calculation method is a forward search algorithm, the position of the address information to be processed in the text is used for searching forward, and after the search, the position is adjacent to a comma, so that search circulation is not performed any more, and the first target address information is the x unit x number of the dd street e cell of the cc district of bb city, aa.
And when the address information to be processed is a ' CC district G town H district ', judging that the address information is forward address information, when the first searching calculation method is a forward searching algorithm, searching forward by using the position of the information to be processed in a text, and judging that a word adjacent to the position is ' bb city ', merging the ' bb city ' and the ' CC district G town H district ' to obtain merged address information which is ' bb city CC district G town H district ', then judging that the merged address information is still forward address information, continuing to search forward, and judging that the word adjacent to the merged address information is ' AA province ', merging the ' AA province ' and the ' bb city CC district G town H district ' to obtain merged address information which is ' AA province CC district G town H district ', then judging that the merged address information is still forward address information, continuing to search forward, and regarding the word adjacent to the merged address information as ' in ', merging the ' CC district G town H district in ' with the ' AA province ' with the ' CC district G town H district, obtaining the combined address information as 'a G town H cell in CC area of AA province bb city', then judging that the combined address information is negative address information, recording the continuous judgment frequency of the combined address information as 1, then continuing to search forwards, using the adjacent word as 'name', combining the 'name' with the 'G town H cell in CC area of AA province bb city CC area', obtaining the combined address information as 'a G town H cell in CC area of AA province bb city', then judging that the combined address information is still negative address information, recording the continuous judgment frequency of the combined address information as 2, then continuing to search forwards, using the adjacent word as 'report', combining the 'report' with the 'G town H cell called CC area of AA province bb city', obtaining the combined address information as 'report called in G town H cell of AA province city CC area', then judging that the combined address information is still negative address information, recording the number of times of continuously judging the address information as negative direction as 3, if the preset number of times of continuously judging the address information as negative direction is 3, stopping forward searching, and determining the' cell G town H cell of CC district of Bbb city, AA, which is judged as positive address information at the last time as the first target information.
However, when the first search algorithm is the backward search algorithm, the search direction is only different from that in the previous example, and the others are the same, and are not repeated again.
Step 207: searching from the position of the first target address information in the text to a second direction corresponding to the second search algorithm, and combining an adjacent word with the address information to be processed to obtain combined address information, wherein when the second search algorithm is a forward search algorithm, the second direction is a forward direction; when the second search algorithm is a backward search algorithm, the second direction is a backward direction.
Step 208: and if the merged address information is forward address information, determining the merged address information as first target address information, and jumping to step 206 until a preset stop symbol adjacent to the address information to be processed is searched in a second direction.
Step 209: if the merged address information is negative address information, recording the continuous times of judging as the negative address information, determining the merged address information as first target address information, and skipping to the step 206 until the continuous times of judging as the negative address information is equal to the preset continuous times, or searching to a preset stop symbol adjacent to the address information to be processed in the second direction.
Step 210: and determining the first target address information which is determined as the forward address information at the last time as the address information to be classified.
Continuing with the above example, the first target address information is "dd street e cell x number x unit x" in CC area of bb city, AA province and "G town H cell in CC area of bb city, AA province". And for the address information to be processed as the x unit x number of the dd street e cell x in the cc area of bb city in aa province, judging that the address information is the forward address information, wherein the second search calculation method is a backward search algorithm, the position of the information to be processed in the text is used for backward search, and the position adjacent to the searched information is a comma, so that search circulation is not performed any more, and the address information to be classified is the x unit x number of the dd street e cell x in the cc area of bb city in aa province.
And for the address information to be processed being 'AA bbc CC district G town H district', judging that the address information is forward address information, the second search algorithm being a backward search algorithm, searching backward by using the position of the information to be processed in the text, and the word adjacent to the backward search algorithm being 'x', merging the 'x' with the 'AA bbc CC district G town H district' to obtain merged address information being 'AA bbc district G town H district x', then judging that the merged address information is still forward address information, continuing searching backward, and the word adjacent to the merged address information being 'x unit', merging the 'x unit' with the 'AA bbc CC district G town H district x' to obtain merged address information being 'AA bbc district G town H district x unit', then judging that the merged address information is still forward address information, continuing searching backward, the adjacent word is the number x, the number x is combined with the number x of the x unit of the cell G town H of the CC district of bb city of AA province, the combined address information is the number x of the x unit of the cell G town H of the CC district of bb city of AA province, then judging that the combined address information is still positive address information, continuously searching backwards, combining adjacent words with the word of 'stolen', combining the 'stolen' with 'x unit x number of x cell of G town H cell of CC district of bb city, AA province' and 'H cell of CC district of bb city', obtaining that the combined address information is 'x unit x number stolen', judging that the address information is negative address information, recording the continuous judgment of the address information as negative address information for the number of 1, and then continuing to search backwards, if the address information adjacent to the backward search is comma, stopping the forward search, and determining the x unit x number of the x cell of the CC district G town H district of the bb city of AA as the address information to be classified.
When the second search algorithm is the forward search algorithm, the search direction is only different from that in the previous example, and the others are the same, and are not described again.
Step 211: and if the address information to be processed is negative address information, performing word segmentation processing on the address information to be processed to obtain a plurality of words.
Assuming that the address information to be processed extracted in the above example includes "x unit x number is stolen", since the address information to be processed is negative address information, the word segmentation processing is performed on the address information to be processed, and "x unit", "x number", and "stolen" are obtained.
Step 212: and extracting any address participle in the participles, and determining the address participle as address information to be processed.
Since the word segmentation result is that the address word is the 'x unit' or 'x number', any one of the address word can be extracted as the address information to be processed.
Step 213: searching in a first direction corresponding to the first search algorithm from the position of the address information to be processed in the text, and combining an adjacent word with the address information to be processed to obtain combined address information, wherein when the first search algorithm is a forward search algorithm, the first direction is a forward direction; when the first search algorithm is a backward search algorithm, the first direction is a backward direction.
Step 214: and if the merged address information is forward address information, determining the merged address information as address information to be processed, and jumping to step 212 until a preset stop symbol adjacent to the address information to be processed is searched in a first direction.
Step 215: if the merged address information is negative address information, recording the continuous times of judging as the negative address information, determining the merged address information as the address information to be processed, and jumping to step 212 until the continuous times of judging as the negative address information is equal to the preset continuous times, or searching to the first direction until a preset stop symbol adjacent to the address information to be processed is reached.
Step 216: and determining the address information to be processed which is determined as the forward address information at the last time as first target address information.
Step 217: searching from the position of the first target address information in the text to a second direction corresponding to the second search algorithm, and combining an adjacent word with the address information to be processed to obtain combined address information, wherein when the second search algorithm is a forward search algorithm, the second direction is a forward direction; when the second search algorithm is a backward search algorithm, the second direction is a backward direction.
Step 218: and if the merged address information is forward address information, determining the merged address information as first target address information, and skipping to step 216 until a preset stop symbol adjacent to the address information to be processed is searched in a second direction.
Step 219: if the merged address information is negative address information, recording the continuous times of judging as the negative address information, determining the merged address information as first target address information, and skipping to step 214 until the continuous times of judging as the negative address information is equal to the preset continuous times, or searching in a second direction until a preset stop symbol adjacent to the address information to be processed is reached.
Step 220: and determining the first target address information which is determined as the forward address information at the last time as the address information to be classified.
The processing from step 211 to step 221 is the same as the processing from step 203 to step 210, and is not described herein again. Therefore, by combining the forward search algorithm and the backward search algorithm, the boundary of the complete address information can be divided, other words are not contained, and the accuracy and the integrity of the result of the subsequent processing can be improved.
Step 221: and acquiring context information of each address information to be classified to obtain target text information to which each address information to be classified belongs.
The context information of the address information to be classified is words with a preset number in a forward and backward direction of the position of the address to be classified in the text, and if the words contain preset punctuation marks such as commas, periods and semicolons, the words among the punctuation marks are taken as the standard, so that the target text information containing the address information to be classified is obtained. For example, the address information to be classified is "x unit x number of dd street e cell x of bb city cc area aa, aa province", and the preset number of words in the forward and backward directions is 3, but since the address information to be classified is followed by a comma and only two words in the front, the target text information to which the address information belongs is "x unit x number of a certain (household address: x unit x number of dd street e cell x of bb city cc area aa province").
Step 222: and replacing the address information to be classified in each target text message with preset characters.
The preset characters are not limited in the embodiment of the present application, and may be letters or numbers, for example, when address information to be classified in "some (household register: aa bb city cc area dd street e cell x number x unit x number" is replaced with a character string aaaaaa ", then" some (household register: aaaaaa ". The address information to be classified is replaced with the preset characters, which can avoid interference of the address information to be classified on subsequent semantic analysis, and improve classification accuracy.
Step 223: and classifying the address information to be classified in each replaced target text information by utilizing a semantic classification model according to the semantic meaning of each replaced target text information to obtain the category of each address information to be classified.
The semantic classification model is obtained by training a training sample through TextCNN. The TextCNN is applied to Chinese text processing and has high accuracy. The common usage scenario of TextCNN is single classification, and a convolution layer, a pooling layer, and a full connection layer are connected to a Softmax layer. And the Softmax layer outputs probability distribution on the classes, wherein the classification with the maximum probability is the final output result of the classification model. The single classification model can even reach 97% accuracy in the business scene.
And performing semantic analysis on the replaced target text information by using a semantic classification model, and then classifying to obtain the category of the address information to be classified. For example, semantic analysis is performed on a certain (household) text message "aaaaaaa" after replacement, and classification is performed to obtain aaaaaaa as a household address, aaaaaaa is converted into corresponding address information to be classified, and finally, the result is the household address, namely, aa province bb city cc area dd street e area x unit x number.
Step 224: and outputting each address information to be classified and the corresponding category.
According to the technical scheme, the address information in the text is extracted as the address information to be processed, the forward search algorithm and the backward search algorithm are utilized according to the integrity of the address information to be processed and the position of the address information to be processed in the text to obtain the complete address to be classified, and then the context information of the address to be classified is utilized to classify the address to be classified. Therefore, whether the extracted address information is complete or not, the complete address information can be finally obtained and accurately classified, and the accuracy of the classification result is improved.
In a second aspect, referring to fig. 3, the present application provides an address information classifying apparatus, including:
the extraction module 301 is configured to extract all address information to be processed in the text;
a determining module 302, configured to determine, according to each piece of to-be-processed address information, an integrity type of each piece of to-be-processed address information, where the integrity type of each piece of to-be-processed address information includes positive address information and negative address information, the positive address information includes complete or partial address information, and the negative address information includes address information including other words;
a to-be-classified address determining module 303, configured to obtain, according to the integrity type of each piece of to-be-processed address information and the position of the to-be-processed address information in the text, to-be-classified address information corresponding to each piece of to-be-processed address information by using a forward search algorithm and a backward search algorithm, where the to-be-classified address information is complete address information;
the classification module 304 is configured to classify each piece of address information to be classified by using context information of each piece of address information to be classified, so as to obtain a category corresponding to each piece of address information to be classified;
an output module 305, configured to output each address information to be classified and a corresponding category.
Further, referring to fig. 4, the address to be classified determining module 303 includes:
a first search algorithm unit 401, configured to, if the address information to be processed is forward address information, obtain first target address information by using a first search algorithm from a position of the address information to be processed in the text, where the first search algorithm is a forward search algorithm or a backward search algorithm;
a second search algorithm unit 402, configured to obtain address information to be classified by using a second search algorithm from a position of the first target address information in the text, where the address information to be classified is complete address information, and when the first search algorithm is a forward search algorithm, the second search algorithm is a backward search algorithm; when the first search algorithm is a backward search algorithm, the second search algorithm is a forward search algorithm.
Further, referring to fig. 5, the address to be classified determining module 303 further includes:
a word segmentation unit 501, configured to perform word segmentation on the address information to be processed to obtain multiple word segments if the address information to be processed is negative address information;
an extracting unit 502, configured to extract any address participle from the multiple participles, and determine the address participle as address information to be processed;
the first search algorithm unit 401 is further configured to obtain first target address information by using a first search algorithm from the position of the address information to be processed in the text, where the first search algorithm is a forward search algorithm or a backward search algorithm;
the second search algorithm unit 402 is further configured to obtain address information to be classified by using a second search algorithm from the position of the first target address information in the text, where the address information to be classified is complete address information, and when the first search algorithm is a forward search algorithm, the second search algorithm is a backward search algorithm; when the first search algorithm is a backward search algorithm, the second search algorithm is a forward search algorithm.
Further, referring to fig. 6, the first search algorithm unit 401 includes:
a first direction search subunit 601, configured to start a search from the position of the address information to be processed in the text to a first direction corresponding to the first search algorithm, and merge an adjacent word with the address information to be processed to obtain merged address information, where when the first search algorithm is a forward search algorithm, the first direction is a forward direction; when the first search algorithm is a backward search algorithm, the first direction is a backward direction;
a loop judgment subunit 602, configured to determine the merged address information as address information to be processed if the merged address information is forward address information, and repeat the step of searching in the first direction until a preset stop symbol adjacent to the address information to be processed is searched in the first direction; if the merged address information is negative address information, recording the continuous times of judging as the negative address information, determining the merged address information as address information to be processed, and repeating the step of searching in the first direction until the continuous times of judging as the negative address information is equal to the preset continuous times, or searching in the first direction until a preset stop sign adjacent to the address information to be processed is reached;
a determining subunit 603, configured to determine, as the first target address information, to-be-processed address information that was determined as forward address information last time.
According to the technical scheme, the address information in the text is extracted as the address information to be processed, the forward search algorithm and the backward search algorithm are utilized according to the integrity of the address information to be processed and the position of the address information to be processed in the text to obtain the complete address to be classified, and then the context information of the address to be classified is utilized to classify the address to be classified. Therefore, whether the extracted address information is complete or not, the complete address information can be finally obtained and accurately classified, and the accuracy of the classification result is improved.
Those skilled in the art will clearly understand that the techniques in the embodiments of the present application may be implemented by way of software plus a required general hardware platform. Based on such understanding, the technical solutions in the embodiments of the present application may be embodied in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, or the like, and includes a computer device (which may be a personal computer, a server, or a network device) for executing the method described in the embodiments or some parts of the embodiments of the present application.
The embodiments of the present disclosure are described in a progressive manner, and the same and similar parts among the embodiments can be referred to each other, and each embodiment focuses on the differences from the other embodiments, and particularly, for the embodiment of the apparatus, since it is substantially similar to the embodiment of the method, the description is relatively simple, and related parts can be referred to the part of the embodiment of the method.

Claims (9)

1. A method for classifying address information, the method comprising:
extracting all address information to be processed in the text;
determining the integrity type of each piece of address information to be processed according to each piece of address information to be processed, wherein the integrity type of the address information to be processed comprises positive address information and negative address information, the positive address information comprises complete or partial address information, and the negative address information comprises address information containing other words;
according to the integrity type of each piece of address information to be processed and the position of the address information to be processed in the text, obtaining address information to be classified corresponding to each piece of address information to be processed by utilizing a forward search algorithm and a backward search algorithm, wherein the address information to be classified is complete address information;
classifying each address information to be classified by using the context information of each address information to be classified to obtain a category corresponding to each address information to be classified;
outputting each address information to be classified and the corresponding category;
the classifying each address information to be classified by using the context information of each address information to be classified to obtain the category corresponding to each address information to be classified, includes:
acquiring context information of each address information to be classified to obtain target text information to which each address information to be classified belongs;
replacing the address information to be classified in each target text message with preset characters;
and classifying the address information to be classified in each replaced target text information by utilizing a semantic classification model according to the semantic meaning of each replaced target text information to obtain the category of each address information to be classified.
2. The method of claim 1, wherein the obtaining the address information to be classified corresponding to each piece of the address information to be processed by using a forward search algorithm and a backward search algorithm according to the integrity type of each piece of the address information to be processed and the position of the address information to be processed in the text comprises:
if the address information to be processed is forward address information, obtaining first target address information by utilizing a first search algorithm from the position of the address information to be processed in the text, wherein the first search algorithm is a forward search algorithm or a backward search algorithm;
obtaining address information to be classified by using a second search algorithm from the position of the first target address information in the text, wherein the address information to be classified is complete address information, and when the first search algorithm is a forward search algorithm, the second search algorithm is a backward search algorithm; when the first search algorithm is a backward search algorithm, the second search algorithm is a forward search algorithm.
3. The method of claim 1, wherein the obtaining the address information to be classified corresponding to each piece of the address information to be processed by using a forward search algorithm and a backward search algorithm according to the integrity type of each piece of the address information to be processed and the position of the address information to be processed in the text comprises:
if the address information to be processed is negative address information, performing word segmentation processing on the address information to be processed to obtain a plurality of words;
extracting any address participle in the participles, and determining the address participle as address information to be processed;
starting from the position of the address information to be processed in the text, obtaining first target address information by utilizing a first search algorithm, wherein the first search algorithm is a forward search algorithm or a backward search algorithm;
obtaining address information to be classified by using a second search algorithm from the position of the first target address information in the text, wherein the address information to be classified is complete address information, and when the first search algorithm is a forward search algorithm, the second search algorithm is a backward search algorithm; when the first search algorithm is a backward search algorithm, the second search algorithm is a forward search algorithm.
4. The method of claim 2 or 3, wherein the obtaining first target address information by using a first search algorithm starting from the position of the address information to be processed in the text comprises:
searching in a first direction corresponding to the first search algorithm from the position of the address information to be processed in the text, and combining an adjacent word with the address information to be processed to obtain combined address information, wherein when the first search algorithm is a forward search algorithm, the first direction is a forward direction; when the first search algorithm is a backward search algorithm, the first direction is a backward direction;
if the merged address information is forward address information, determining the merged address information as address information to be processed, and repeating the step of searching in the first direction until a preset stop symbol adjacent to the address information to be processed is searched in the first direction; if the merged address information is negative address information, recording the continuous times of judging as the negative address information, determining the merged address information as address information to be processed, and repeating the step of searching in the first direction until the continuous times of judging as the negative address information is equal to the preset continuous times, or searching in the first direction until a preset stop sign adjacent to the address information to be processed is reached;
and determining the address information to be processed which is determined as the forward address information at the last time as first target address information.
5. The method of claim 2 or 3, wherein the obtaining address information to be classified by using a second search algorithm starting from the position of the first target address information in the text comprises:
searching from the position of the first target address information in the text to a second direction corresponding to the second search algorithm, and combining an adjacent word with the address information to be processed to obtain combined address information, wherein when the second search algorithm is a forward search algorithm, the second direction is a forward direction; when the second search algorithm is a backward search algorithm, the second direction is a backward direction;
if the merged address information is forward address information, determining the merged address information as first target address information, and repeating the step of searching in the second direction until a preset stop symbol adjacent to the address information to be processed is searched in the second direction; if the merged address information is negative address information, recording the continuous times of judging as the negative address information, determining the merged address information as first target address information, and repeating the step of searching in the second direction until the continuous times of judging as the negative address information is equal to the preset continuous times, or searching in the second direction until a preset stop symbol adjacent to the address information to be processed is reached;
and determining the first target address information which is determined as the forward address information at the last time as the address information to be classified.
6. An address information classifying apparatus, characterized in that the apparatus comprises:
the extraction module is used for extracting all address information to be processed in the text;
the determining module is used for determining the integrity type of each piece of address information to be processed according to each piece of address information to be processed, wherein the integrity type of the address information to be processed comprises positive address information and negative address information, the positive address information comprises complete or partial address information, and the negative address information comprises address information containing other words;
the address to be classified determining module is used for obtaining address information to be classified corresponding to each address information to be processed by utilizing a forward search algorithm and a backward search algorithm according to the integrity type of each address information to be processed and the position of the address information to be processed in the text, wherein the address information to be classified is complete address information;
the classification module is configured to classify each piece of address information to be classified by using context information of each piece of address information to be classified, to obtain a category corresponding to each piece of address information to be classified, and includes:
acquiring context information of each address information to be classified to obtain target text information to which each address information to be classified belongs;
replacing the address information to be classified in each target text message with preset characters;
classifying the address information to be classified in each replaced target text information by utilizing a semantic classification model according to the semantic meaning of each replaced target text information to obtain the category of each address information to be classified;
and the output module is used for outputting each address information to be classified and the corresponding category.
7. The apparatus of claim 6, wherein the address to be classified determining module comprises:
a first search algorithm unit, configured to, if the address information to be processed is forward address information, obtain first target address information by using a first search algorithm from a position of the address information to be processed in the text, where the first search algorithm is a forward search algorithm or a backward search algorithm;
the second search algorithm unit is used for obtaining the address information to be classified by using a second search algorithm from the position of the first target address information in the text, wherein when the first search algorithm is a forward search algorithm, the second search algorithm is a backward search algorithm; when the first search algorithm is a backward search algorithm, the second search algorithm is a forward search algorithm.
8. The apparatus of claim 6, wherein the address to be classified determining module further comprises:
the word segmentation unit is used for performing word segmentation on the address information to be processed to obtain a plurality of words if the address information to be processed is negative address information;
the extraction unit is used for extracting any address participle in the participles and determining the address participle as address information to be processed;
the first search algorithm unit is further used for obtaining first target address information by using a first search algorithm from the position of the address information to be processed in the text, wherein the first search algorithm is a forward search algorithm or a backward search algorithm;
the second search algorithm unit is further used for obtaining the address information to be classified by using a second search algorithm from the position of the first target address information in the text, wherein the address information to be classified is complete address information, and when the first search algorithm is a forward search algorithm, the second search algorithm is a backward search algorithm; when the first search algorithm is a backward search algorithm, the second search algorithm is a forward search algorithm.
9. The apparatus of claim 7 or 8, wherein the first search algorithm unit comprises:
the first direction searching subunit is configured to start searching in a first direction corresponding to the first search algorithm from the position of the address information to be processed in the text, merge an adjacent word with the address information to be processed, and obtain merged address information, where when the first search algorithm is a forward search algorithm, the first direction is a forward direction; when the first search algorithm is a backward search algorithm, the first direction is a backward direction;
a loop judgment subunit, configured to determine the merged address information as address information to be processed if the merged address information is forward address information, and repeat the step of searching in the first direction until a preset stop symbol adjacent to the address information to be processed is searched in the first direction; if the merged address information is negative address information, recording the continuous times of judging as the negative address information, determining the merged address information as address information to be processed, and repeating the step of searching in the first direction until the continuous times of judging as the negative address information is equal to the preset continuous times, or searching in the first direction until a preset stop sign adjacent to the address information to be processed is reached;
and the determining subunit is used for determining the address information to be processed which is determined as the forward address information at the last time as the first target address information.
CN201811102935.5A 2018-09-20 2018-09-20 Address information classification method and device Active CN109344254B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811102935.5A CN109344254B (en) 2018-09-20 2018-09-20 Address information classification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811102935.5A CN109344254B (en) 2018-09-20 2018-09-20 Address information classification method and device

Publications (2)

Publication Number Publication Date
CN109344254A CN109344254A (en) 2019-02-15
CN109344254B true CN109344254B (en) 2020-12-18

Family

ID=65306508

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811102935.5A Active CN109344254B (en) 2018-09-20 2018-09-20 Address information classification method and device

Country Status (1)

Country Link
CN (1) CN109344254B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110738305A (en) * 2019-08-27 2020-01-31 深圳市跨越新科技有限公司 method and system for analyzing logistics waybill address

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7428576B2 (en) * 2000-05-16 2008-09-23 Hoshiko Llc Addressee-defined mail addressing system and method
CN101980208A (en) * 2010-11-10 2011-02-23 百度在线网络技术(北京)有限公司 Address query method and system
CN103440312B (en) * 2013-08-27 2019-01-22 深圳市华傲数据技术有限公司 A kind of system and terminal of mailing address inquiry postcode
US9582486B2 (en) * 2014-05-13 2017-02-28 Lc Cns Co., Ltd. Apparatus and method for classifying and analyzing documents including text
CN107305540B (en) * 2016-04-20 2021-03-02 顺丰科技有限公司 Address segmentation recognition method
CN108509441A (en) * 2017-02-24 2018-09-07 菜鸟智能物流控股有限公司 Training of address validity classifier, verification method thereof and related device
CN107368470A (en) * 2017-06-27 2017-11-21 北京神州泰岳软件股份有限公司 A kind of method and apparatus for extracting enterprises organizational structure information

Also Published As

Publication number Publication date
CN109344254A (en) 2019-02-15

Similar Documents

Publication Publication Date Title
CN107657048B (en) User identification method and device
CN101305370B (en) Information classification paradigm
US20180075013A1 (en) Method and system for automating training of named entity recognition in natural language processing
US10831993B2 (en) Method and apparatus for constructing binary feature dictionary
CN108121715B (en) Character labeling method and character labeling device
CN110413998B (en) Self-adaptive Chinese word segmentation method oriented to power industry, system and medium thereof
CN111309910A (en) Text information mining method and device
CN110910175B (en) Image generation method for travel ticket product
US20210103699A1 (en) Data extraction method and data extraction device
CN104850617A (en) Short text processing method and apparatus
CN107451120B (en) Content conflict detection method and system for open text information
CN112784009A (en) Subject term mining method and device, electronic equipment and storage medium
CN101470699B (en) Information extraction model training apparatus, information extraction apparatus and information extraction system and method thereof
CN112395881B (en) Material label construction method and device, readable storage medium and electronic equipment
CN109344254B (en) Address information classification method and device
Ohta et al. CRF-based bibliography extraction from reference strings focusing on various token granularities
CN102103502A (en) Method and system for analyzing a legacy system based on trails through the legacy system
CN112765976A (en) Text similarity calculation method, device and equipment and storage medium
CN112069818A (en) Triple prediction model generation method, relation triple extraction method and device
CN112115362B (en) Programming information recommendation method and device based on similar code recognition
CN115438645A (en) Text data enhancement method and system for sequence labeling task
CN111651987B (en) Identity discrimination method and device, computer readable storage medium and electronic equipment
CN114490993A (en) Small sample intention recognition method, system, equipment and storage medium
Souza et al. ARCTIC: metadata extraction from scientific papers in pdf using two-layer CRF
CN113591857A (en) Character image processing method and device and ancient Chinese book image identification method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20190906

Address after: Room 630, 6th floor, Block A, Wanliu Xingui Building, 28 Wanquanzhuang Road, Haidian District, Beijing

Applicant after: China Science and Technology (Beijing) Co., Ltd.

Address before: 100089 Beijing city Haidian District wanquanzhuang Road No. 28 Wanliu new building block A Room 601

Applicant before: Beijing Shenzhou Taiyue Software Co., Ltd.

TA01 Transfer of patent application right
CB02 Change of applicant information

Address after: 230000 zone B, 19th floor, building A1, 3333 Xiyou Road, hi tech Zone, Hefei City, Anhui Province

Applicant after: Dingfu Intelligent Technology Co., Ltd

Address before: Room 630, 6th floor, Block A, Wanliu Xingui Building, 28 Wanquanzhuang Road, Haidian District, Beijing

Applicant before: DINFO (BEIJING) SCIENCE DEVELOPMENT Co.,Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant