Detailed Description
In a first aspect, referring to fig. 1, an embodiment of the present application provides an address information classification method, where the method includes the following steps:
step 101: and extracting all address information to be processed in the text.
The extraction of the address information to be processed in the text can be completed by using an address information extraction model. Specifically, a Chinese word segmentation system is used for performing word segmentation and part-of-speech tagging on enough training texts item by item, and then a bilSTM model is used for training the training texts, so that an address extraction model is generated. The worker can extract the address information in the text by using the model.
Step 102: and determining the integrity type of each piece of address information to be processed according to each piece of address information to be processed, wherein the integrity type of the address information to be processed comprises positive address information and negative address information, the positive address information comprises complete or partial address information, and the negative address information comprises address information containing other words.
Step 103: and obtaining address information to be classified corresponding to each address information to be processed by utilizing a forward search algorithm and a backward search algorithm according to the integrity type of each address information to be processed and the position of the address information to be processed in the text, wherein the address information to be classified is complete address information.
The boundary of the address information to be processed can be accurately marked out by combining the forward search algorithm and the backward search algorithm, and the accuracy and the integrity of subsequent data processing can be improved.
Step 104: and classifying the address information to be classified by using the context information of the address information to be classified to obtain the category corresponding to the address information to be classified.
Step 105: and outputting each address information to be classified and the corresponding category.
According to the technical scheme, the address information in the text is extracted as the address information to be processed, the forward search algorithm and the backward search algorithm are utilized according to the integrity of the address information to be processed and the position of the address information to be processed in the text to obtain the complete address to be classified, and then the context information of the address to be classified is utilized to classify the address to be classified. Therefore, whether the extracted address information is complete or not, the complete address information can be finally obtained and accurately classified, and the accuracy of the classification result is improved.
Referring to fig. 2, an address information classification method provided in another embodiment of the present application includes the following steps:
step 201: and extracting all address information to be processed in the text.
The extraction of the address information to be processed in the text can be completed by using an address information extraction model. Specifically, a Chinese word segmentation system is used for performing word segmentation and part-of-speech tagging on enough training texts item by item, and then a bilSTM model is used for training the training texts, so that an address extraction model is generated. The worker can extract the address information in the text by using the model.
Step 202: and determining the integrity type of each piece of address information to be processed according to each piece of address information to be processed, wherein the integrity type of the address information to be processed comprises positive address information and negative address information, the positive address information comprises complete or partial address information, and the negative address information comprises address information containing other words.
The address information extracted by the address information extraction model may be the complete address information in the text, or partial address information, or address information including other words. For example, the text is that a certain (household registration: the x unit x number of the x unit of the dd street e cell x in the CC area of BB city AA province, identity number: xxxxxxxxxxx) report states that the x unit x number of the x unit of the H cell x in the G town H cell of the CC area of BB city AA province is stolen, the door lock is intact, and the safe in the home is pried, and if the result extracted by the address information model is the x unit x number of the dd street e cell x in the CC area of BB city AA province and the H cell in the CC area G town H cell, the x unit x number of the x unit x in the dd street cell e cell x in the CC area of BB city AA province is complete address information, namely forward address information; "CC area G town H cell" is part of address information and also belongs to forward address information. If the extraction result is that the x unit x number is stolen, and the term of being stolen is included in the x unit x number, the x unit x number is negative address information.
Step 203: if the address information to be processed is forward address information, searching in a first direction corresponding to the first search algorithm from the position of the address information to be processed in the text, and combining an adjacent word with the address information to be processed to obtain combined address information, wherein when the first search algorithm is a forward search algorithm, the first direction is the forward direction; when the first search algorithm is a backward search algorithm, the first direction is a backward direction.
Step 204: and if the merged address information is forward address information, determining the merged address information as address information to be processed, and jumping to step 203 until a preset stop symbol adjacent to the address information to be processed is searched in a first direction.
Step 205: if the merged address information is negative address information, recording the continuous times of judging as the negative address information, determining the merged address information as address information to be processed, and jumping to step 203 until the continuous times of judging as the negative address information is equal to the preset continuous times, or searching to a first direction until a preset stop symbol adjacent to the address information to be processed is reached.
Step 206: and determining the address information to be processed which is determined as the forward address information at the last time as first target address information.
The preset stop symbol may be set by a worker, for example, a comma, a semicolon, or the like. Continuing with the text example in the above example, assume that the results extracted using the address information model are "aa province bb city CC area dd street e cell x number x unit x" and "CC area G town H cell". And when the first search calculation method is a forward search algorithm, the position of the address information to be processed in the text is used for searching forward, and after the search, the position is adjacent to a comma, so that search circulation is not performed any more, and the first target address information is the x unit x number of the dd street e cell of the cc district of bb city, aa.
And when the address information to be processed is a ' CC district G town H district ', judging that the address information is forward address information, when the first searching calculation method is a forward searching algorithm, searching forward by using the position of the information to be processed in a text, and judging that a word adjacent to the position is ' bb city ', merging the ' bb city ' and the ' CC district G town H district ' to obtain merged address information which is ' bb city CC district G town H district ', then judging that the merged address information is still forward address information, continuing to search forward, and judging that the word adjacent to the merged address information is ' AA province ', merging the ' AA province ' and the ' bb city CC district G town H district ' to obtain merged address information which is ' AA province CC district G town H district ', then judging that the merged address information is still forward address information, continuing to search forward, and regarding the word adjacent to the merged address information as ' in ', merging the ' CC district G town H district in ' with the ' AA province ' with the ' CC district G town H district, obtaining the combined address information as 'a G town H cell in CC area of AA province bb city', then judging that the combined address information is negative address information, recording the continuous judgment frequency of the combined address information as 1, then continuing to search forwards, using the adjacent word as 'name', combining the 'name' with the 'G town H cell in CC area of AA province bb city CC area', obtaining the combined address information as 'a G town H cell in CC area of AA province bb city', then judging that the combined address information is still negative address information, recording the continuous judgment frequency of the combined address information as 2, then continuing to search forwards, using the adjacent word as 'report', combining the 'report' with the 'G town H cell called CC area of AA province bb city', obtaining the combined address information as 'report called in G town H cell of AA province city CC area', then judging that the combined address information is still negative address information, recording the number of times of continuously judging the address information as negative direction as 3, if the preset number of times of continuously judging the address information as negative direction is 3, stopping forward searching, and determining the' cell G town H cell of CC district of Bbb city, AA, which is judged as positive address information at the last time as the first target information.
However, when the first search algorithm is the backward search algorithm, the search direction is only different from that in the previous example, and the others are the same, and are not repeated again.
Step 207: searching from the position of the first target address information in the text to a second direction corresponding to the second search algorithm, and combining an adjacent word with the address information to be processed to obtain combined address information, wherein when the second search algorithm is a forward search algorithm, the second direction is a forward direction; when the second search algorithm is a backward search algorithm, the second direction is a backward direction.
Step 208: and if the merged address information is forward address information, determining the merged address information as first target address information, and jumping to step 206 until a preset stop symbol adjacent to the address information to be processed is searched in a second direction.
Step 209: if the merged address information is negative address information, recording the continuous times of judging as the negative address information, determining the merged address information as first target address information, and skipping to the step 206 until the continuous times of judging as the negative address information is equal to the preset continuous times, or searching to a preset stop symbol adjacent to the address information to be processed in the second direction.
Step 210: and determining the first target address information which is determined as the forward address information at the last time as the address information to be classified.
Continuing with the above example, the first target address information is "dd street e cell x number x unit x" in CC area of bb city, AA province and "G town H cell in CC area of bb city, AA province". And for the address information to be processed as the x unit x number of the dd street e cell x in the cc area of bb city in aa province, judging that the address information is the forward address information, wherein the second search calculation method is a backward search algorithm, the position of the information to be processed in the text is used for backward search, and the position adjacent to the searched information is a comma, so that search circulation is not performed any more, and the address information to be classified is the x unit x number of the dd street e cell x in the cc area of bb city in aa province.
And for the address information to be processed being 'AA bbc CC district G town H district', judging that the address information is forward address information, the second search algorithm being a backward search algorithm, searching backward by using the position of the information to be processed in the text, and the word adjacent to the backward search algorithm being 'x', merging the 'x' with the 'AA bbc CC district G town H district' to obtain merged address information being 'AA bbc district G town H district x', then judging that the merged address information is still forward address information, continuing searching backward, and the word adjacent to the merged address information being 'x unit', merging the 'x unit' with the 'AA bbc CC district G town H district x' to obtain merged address information being 'AA bbc district G town H district x unit', then judging that the merged address information is still forward address information, continuing searching backward, the adjacent word is the number x, the number x is combined with the number x of the x unit of the cell G town H of the CC district of bb city of AA province, the combined address information is the number x of the x unit of the cell G town H of the CC district of bb city of AA province, then judging that the combined address information is still positive address information, continuously searching backwards, combining adjacent words with the word of 'stolen', combining the 'stolen' with 'x unit x number of x cell of G town H cell of CC district of bb city, AA province' and 'H cell of CC district of bb city', obtaining that the combined address information is 'x unit x number stolen', judging that the address information is negative address information, recording the continuous judgment of the address information as negative address information for the number of 1, and then continuing to search backwards, if the address information adjacent to the backward search is comma, stopping the forward search, and determining the x unit x number of the x cell of the CC district G town H district of the bb city of AA as the address information to be classified.
When the second search algorithm is the forward search algorithm, the search direction is only different from that in the previous example, and the others are the same, and are not described again.
Step 211: and if the address information to be processed is negative address information, performing word segmentation processing on the address information to be processed to obtain a plurality of words.
Assuming that the address information to be processed extracted in the above example includes "x unit x number is stolen", since the address information to be processed is negative address information, the word segmentation processing is performed on the address information to be processed, and "x unit", "x number", and "stolen" are obtained.
Step 212: and extracting any address participle in the participles, and determining the address participle as address information to be processed.
Since the word segmentation result is that the address word is the 'x unit' or 'x number', any one of the address word can be extracted as the address information to be processed.
Step 213: searching in a first direction corresponding to the first search algorithm from the position of the address information to be processed in the text, and combining an adjacent word with the address information to be processed to obtain combined address information, wherein when the first search algorithm is a forward search algorithm, the first direction is a forward direction; when the first search algorithm is a backward search algorithm, the first direction is a backward direction.
Step 214: and if the merged address information is forward address information, determining the merged address information as address information to be processed, and jumping to step 212 until a preset stop symbol adjacent to the address information to be processed is searched in a first direction.
Step 215: if the merged address information is negative address information, recording the continuous times of judging as the negative address information, determining the merged address information as the address information to be processed, and jumping to step 212 until the continuous times of judging as the negative address information is equal to the preset continuous times, or searching to the first direction until a preset stop symbol adjacent to the address information to be processed is reached.
Step 216: and determining the address information to be processed which is determined as the forward address information at the last time as first target address information.
Step 217: searching from the position of the first target address information in the text to a second direction corresponding to the second search algorithm, and combining an adjacent word with the address information to be processed to obtain combined address information, wherein when the second search algorithm is a forward search algorithm, the second direction is a forward direction; when the second search algorithm is a backward search algorithm, the second direction is a backward direction.
Step 218: and if the merged address information is forward address information, determining the merged address information as first target address information, and skipping to step 216 until a preset stop symbol adjacent to the address information to be processed is searched in a second direction.
Step 219: if the merged address information is negative address information, recording the continuous times of judging as the negative address information, determining the merged address information as first target address information, and skipping to step 214 until the continuous times of judging as the negative address information is equal to the preset continuous times, or searching in a second direction until a preset stop symbol adjacent to the address information to be processed is reached.
Step 220: and determining the first target address information which is determined as the forward address information at the last time as the address information to be classified.
The processing from step 211 to step 221 is the same as the processing from step 203 to step 210, and is not described herein again. Therefore, by combining the forward search algorithm and the backward search algorithm, the boundary of the complete address information can be divided, other words are not contained, and the accuracy and the integrity of the result of the subsequent processing can be improved.
Step 221: and acquiring context information of each address information to be classified to obtain target text information to which each address information to be classified belongs.
The context information of the address information to be classified is words with a preset number in a forward and backward direction of the position of the address to be classified in the text, and if the words contain preset punctuation marks such as commas, periods and semicolons, the words among the punctuation marks are taken as the standard, so that the target text information containing the address information to be classified is obtained. For example, the address information to be classified is "x unit x number of dd street e cell x of bb city cc area aa, aa province", and the preset number of words in the forward and backward directions is 3, but since the address information to be classified is followed by a comma and only two words in the front, the target text information to which the address information belongs is "x unit x number of a certain (household address: x unit x number of dd street e cell x of bb city cc area aa province").
Step 222: and replacing the address information to be classified in each target text message with preset characters.
The preset characters are not limited in the embodiment of the present application, and may be letters or numbers, for example, when address information to be classified in "some (household register: aa bb city cc area dd street e cell x number x unit x number" is replaced with a character string aaaaaa ", then" some (household register: aaaaaa ". The address information to be classified is replaced with the preset characters, which can avoid interference of the address information to be classified on subsequent semantic analysis, and improve classification accuracy.
Step 223: and classifying the address information to be classified in each replaced target text information by utilizing a semantic classification model according to the semantic meaning of each replaced target text information to obtain the category of each address information to be classified.
The semantic classification model is obtained by training a training sample through TextCNN. The TextCNN is applied to Chinese text processing and has high accuracy. The common usage scenario of TextCNN is single classification, and a convolution layer, a pooling layer, and a full connection layer are connected to a Softmax layer. And the Softmax layer outputs probability distribution on the classes, wherein the classification with the maximum probability is the final output result of the classification model. The single classification model can even reach 97% accuracy in the business scene.
And performing semantic analysis on the replaced target text information by using a semantic classification model, and then classifying to obtain the category of the address information to be classified. For example, semantic analysis is performed on a certain (household) text message "aaaaaaa" after replacement, and classification is performed to obtain aaaaaaa as a household address, aaaaaaa is converted into corresponding address information to be classified, and finally, the result is the household address, namely, aa province bb city cc area dd street e area x unit x number.
Step 224: and outputting each address information to be classified and the corresponding category.
According to the technical scheme, the address information in the text is extracted as the address information to be processed, the forward search algorithm and the backward search algorithm are utilized according to the integrity of the address information to be processed and the position of the address information to be processed in the text to obtain the complete address to be classified, and then the context information of the address to be classified is utilized to classify the address to be classified. Therefore, whether the extracted address information is complete or not, the complete address information can be finally obtained and accurately classified, and the accuracy of the classification result is improved.
In a second aspect, referring to fig. 3, the present application provides an address information classifying apparatus, including:
the extraction module 301 is configured to extract all address information to be processed in the text;
a determining module 302, configured to determine, according to each piece of to-be-processed address information, an integrity type of each piece of to-be-processed address information, where the integrity type of each piece of to-be-processed address information includes positive address information and negative address information, the positive address information includes complete or partial address information, and the negative address information includes address information including other words;
a to-be-classified address determining module 303, configured to obtain, according to the integrity type of each piece of to-be-processed address information and the position of the to-be-processed address information in the text, to-be-classified address information corresponding to each piece of to-be-processed address information by using a forward search algorithm and a backward search algorithm, where the to-be-classified address information is complete address information;
the classification module 304 is configured to classify each piece of address information to be classified by using context information of each piece of address information to be classified, so as to obtain a category corresponding to each piece of address information to be classified;
an output module 305, configured to output each address information to be classified and a corresponding category.
Further, referring to fig. 4, the address to be classified determining module 303 includes:
a first search algorithm unit 401, configured to, if the address information to be processed is forward address information, obtain first target address information by using a first search algorithm from a position of the address information to be processed in the text, where the first search algorithm is a forward search algorithm or a backward search algorithm;
a second search algorithm unit 402, configured to obtain address information to be classified by using a second search algorithm from a position of the first target address information in the text, where the address information to be classified is complete address information, and when the first search algorithm is a forward search algorithm, the second search algorithm is a backward search algorithm; when the first search algorithm is a backward search algorithm, the second search algorithm is a forward search algorithm.
Further, referring to fig. 5, the address to be classified determining module 303 further includes:
a word segmentation unit 501, configured to perform word segmentation on the address information to be processed to obtain multiple word segments if the address information to be processed is negative address information;
an extracting unit 502, configured to extract any address participle from the multiple participles, and determine the address participle as address information to be processed;
the first search algorithm unit 401 is further configured to obtain first target address information by using a first search algorithm from the position of the address information to be processed in the text, where the first search algorithm is a forward search algorithm or a backward search algorithm;
the second search algorithm unit 402 is further configured to obtain address information to be classified by using a second search algorithm from the position of the first target address information in the text, where the address information to be classified is complete address information, and when the first search algorithm is a forward search algorithm, the second search algorithm is a backward search algorithm; when the first search algorithm is a backward search algorithm, the second search algorithm is a forward search algorithm.
Further, referring to fig. 6, the first search algorithm unit 401 includes:
a first direction search subunit 601, configured to start a search from the position of the address information to be processed in the text to a first direction corresponding to the first search algorithm, and merge an adjacent word with the address information to be processed to obtain merged address information, where when the first search algorithm is a forward search algorithm, the first direction is a forward direction; when the first search algorithm is a backward search algorithm, the first direction is a backward direction;
a loop judgment subunit 602, configured to determine the merged address information as address information to be processed if the merged address information is forward address information, and repeat the step of searching in the first direction until a preset stop symbol adjacent to the address information to be processed is searched in the first direction; if the merged address information is negative address information, recording the continuous times of judging as the negative address information, determining the merged address information as address information to be processed, and repeating the step of searching in the first direction until the continuous times of judging as the negative address information is equal to the preset continuous times, or searching in the first direction until a preset stop sign adjacent to the address information to be processed is reached;
a determining subunit 603, configured to determine, as the first target address information, to-be-processed address information that was determined as forward address information last time.
According to the technical scheme, the address information in the text is extracted as the address information to be processed, the forward search algorithm and the backward search algorithm are utilized according to the integrity of the address information to be processed and the position of the address information to be processed in the text to obtain the complete address to be classified, and then the context information of the address to be classified is utilized to classify the address to be classified. Therefore, whether the extracted address information is complete or not, the complete address information can be finally obtained and accurately classified, and the accuracy of the classification result is improved.
Those skilled in the art will clearly understand that the techniques in the embodiments of the present application may be implemented by way of software plus a required general hardware platform. Based on such understanding, the technical solutions in the embodiments of the present application may be embodied in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, or the like, and includes a computer device (which may be a personal computer, a server, or a network device) for executing the method described in the embodiments or some parts of the embodiments of the present application.
The embodiments of the present disclosure are described in a progressive manner, and the same and similar parts among the embodiments can be referred to each other, and each embodiment focuses on the differences from the other embodiments, and particularly, for the embodiment of the apparatus, since it is substantially similar to the embodiment of the method, the description is relatively simple, and related parts can be referred to the part of the embodiment of the method.