CN107153652B

CN107153652B - Method and device for converting target character string into normalized character string

Info

Publication number: CN107153652B
Application number: CN201610121395.XA
Authority: CN
Inventors: 赵科科
Original assignee: Advanced New Technologies Co Ltd
Current assignee: Advanced New Technologies Co Ltd; Advantageous New Technologies Co Ltd
Priority date: 2016-03-03
Filing date: 2016-03-03
Publication date: 2020-10-30
Anticipated expiration: 2036-03-03
Also published as: CN107153652A

Abstract

The application provides a method and a device for converting a target character string into a normalized character string, wherein the method comprises the following steps: traversing and segmenting the target character string based on the specified segmentation length to obtain a first segmentation unit set; searching the normalized character strings associated with all the segmentation units in the first segmentation unit set in the index list to obtain a normalized character string set corresponding to the first segmentation unit set; the index list comprises a segmentation unit obtained by traversing and segmenting the normalized character string and the normalized character string associated with the segmentation unit; calculating the similarity of the target character string and each character string in the normalized character string set; searching a normalized character string corresponding to the target character string in the normalized character string set based on the calculated similarity; and converting the target character string into the searched normalized character string. According to the method and the device, the calculated amount can be reduced when the normalized character string corresponding to the target character string is searched, and the searching efficiency of the normalized character string is improved.

Description

Method and device for converting target character string into normalized character string

Technical Field

The present application relates to the field of communications, and in particular, to a method and an apparatus for converting a target character string into a normalized character string.

Background

In the internet platform, short texts such as the address and company name input by the user may be different from the normalized address and company name, but they may still be the same. For example, suppose that the company name collected from the resume that the user reserves in the internet platform is "paypal company", and the standardized full name of paypal company is actually "paypal network technology limited company", and these two addresses are actually the same company name although there is a certain difference. Therefore, for this situation, normalizing these short texts provided by the user has very important significance for subsequent data processing and use.

Disclosure of Invention

The application provides a method for converting a target character string into a normalized character string, which comprises the following steps:

traversing and segmenting the target character string based on the specified segmentation length to obtain a first segmentation unit set;

searching a normalized character string associated with each segmentation unit in the first segmentation unit set in a preset index list to obtain a normalized character string set corresponding to the first segmentation unit set; the index list comprises a segmentation unit obtained by traversing and segmenting the normalized character string and the normalized character string associated with the segmentation unit;

calculating the similarity between the target character string and each character string in the normalized character string set;

searching a normalized character string corresponding to the target character string in the normalized character string set based on the calculated similarity;

and converting the target character string into the searched normalized character string.

Optionally, the preset index list is generated as follows:

traversing and segmenting the normalized character strings stored in the database based on the specified segmentation length to obtain a second segmentation unit set;

generating corresponding index table entries for all the segmentation units in the second segmentation unit set respectively; the information recorded by the index table entry comprises each segmentation unit of the normalized character string and the normalized character string associated with each segmentation unit in the database.

Optionally, the information recorded in the index table entry further includes the number of normalized character strings associated with each splitting unit in the database;

the method further comprises the following steps:

sorting the index entries in the index list based on the number;

and deleting the index table entries recorded in the index list, wherein the number of the index table entries is larger than a preset first threshold value.

Optionally, the method further includes:

and when the corresponding index table entry in the index list of the segmentation unit of any normalized character string is deleted, adding the index table entry with the minimum number recorded in the index table entry corresponding to the segmentation unit into the index list.

Optionally, the searching for the normalized character string associated with each segmentation unit in the first segmentation unit set in the preset index list to obtain the normalized character string set corresponding to the first segmentation unit set includes:

matching the segmentation units in the first segmentation unit set in the index list;

acquiring index table entries matched with the segmentation units in the first segmentation unit set in the index list;

extracting the normalized character string recorded in the acquired index table entry to generate a normalized character string set; or,

extracting the normalized character strings recorded in the index table items of which the number recorded in the obtained index table items is lower than a preset second threshold value, and generating a normalized character string set; wherein the second threshold is less than the first threshold.

Optionally, the calculating the similarity between the target character string and each character string in the normalized character string set includes:

calculating the editing distance between the target character string and each character string in the normalized character string set;

calculating the maximum value and the minimum value in the lengths of the target character string and each character string;

calculating the ratio of the maximum value to the minimum value after subtracting the editing distance, and representing the similarity between the target character string and each character string based on the ratio; or

And calculating the ratio of the maximum value to the minimum value added with the editing distance, and representing the similarity between the target character string and each character string based on the ratio.

Optionally, the calculating the similarity between the target character string and the obtained normalized character string based on the edit distance includes:

calculating the similarity between the target character string and each character string in the normalized character string set based on a preset similarity calculation formula;

the similarity calculation formula includes:

or

Wherein x represents a target character string, | x | represents the length of the target character; y represents a normalized string, | y | represents the length of the normalized string; max (| x |, | y |) represents the maximum of the lengths of the target string and the normalized string; min (| x |, | y |) represents the minimum of the target string and the normalized string length; ds represents the edit distance between the target string and the normalized string; c represents a correction parameter, and is a constant equal to or greater than 0.

Optionally, the searching for the normalized character string corresponding to the target character string in the normalized character string set based on the calculated similarity includes:

judging whether the calculated similarity between the target character string and each character string in the normalized character string set reaches a preset similarity threshold value or not;

and when the similarity between the target character string and any character string in the normalized character string set reaches a preset similarity threshold value, determining that the character string is a normalized character string corresponding to the target character string.

The present application further proposes a device for converting a target character string into a normalized character string, the device comprising:

the first segmentation module is used for performing traversal segmentation on the target character string based on the specified segmentation length to obtain a first segmentation unit set;

the acquisition module is used for searching the normalized character strings associated with the segmentation units in the first segmentation unit set in a preset index list to obtain a normalized character string set corresponding to the first segmentation unit set; the index list comprises a segmentation unit obtained by traversing and segmenting the normalized character string and the normalized character string associated with the segmentation unit;

the calculation module is used for calculating the similarity between the target character string and each character string in the normalized character string set;

the searching module is used for searching the normalized character string corresponding to the target character string in the normalized character string set based on the calculated similarity;

and the conversion module is used for converting the target character string into the searched normalized character string.

Optionally, the apparatus further comprises:

the second segmentation module is used for performing traversal segmentation on the normalized character strings stored in the database based on the specified segmentation length to obtain a second segmentation unit set;

a generating module, configured to generate corresponding index table entries for each segmentation unit in the second segmentation unit set; the information recorded by the index table entry comprises each segmentation unit of the normalized character string and the normalized character string associated with each segmentation unit in the database.

the device further comprises:

a sorting module, configured to sort the index entries in the index list based on the number;

and the deleting module is used for deleting the index table entries recorded in the index list, wherein the number of the index table entries is larger than a preset first threshold value.

Optionally, the apparatus further comprises:

and the adding module is used for adding the index table entry with the minimum number recorded in the index table entry corresponding to the splitting unit into the index list when the corresponding index table entry in the index list is deleted.

Optionally, the obtaining module is specifically configured to:

Optionally, the calculation module is specifically configured to:

Optionally, the calculation module is further configured to:

the similarity calculation formula includes:

or

Optionally, the search module is specifically configured to:

In the method, when a target character string is converted into a normalized character string, character segmentation is carried out on the target character string based on a specified segmentation length to obtain a first segmentation unit set, and normalized character strings associated with segmentation units in the first segmentation unit set are searched in a preset index list to obtain a normalized character string set corresponding to the first segmentation unit set; then calculating the similarity between the target character string and each character string in the normalized character string set based on the editing distance, searching the normalized character string corresponding to the target character string in the normalized character string set based on the calculated similarity, realizing that when searching the normalized character string associated with the target character string in a database, the normalized character string set associated with the target character string can be quickly searched through an index list, calculating the similarity between the target character string and each character string in the normalized character string set based on the editing distance, then searching the normalized character string corresponding to the target character string in the normalized character string set based on the calculated similarity, thereby avoiding the need of calculating the similarity between the target character string and each normalized character string stored in the database one by one when searching the normalized character string corresponding to the target character string, the calculation amount can be reduced, and the searching efficiency of the normalized character string is improved.

Drawings

FIG. 1 is a flow chart of a method for converting a target string into a normalized string according to an embodiment of the present application;

FIG. 2 is a logic block diagram of an apparatus for converting a target string into a normalized string according to an embodiment of the present application;

fig. 3 is a hardware structure diagram of a server that carries the apparatus for converting the target character string into the normalized character string according to an embodiment of the present application.

Detailed Description

In the related art, when the short text such as an address and a company name input by a user is normalized, a database storing the normalized text is usually provided, similarity between the short text provided by the user and each normalized text in the database is calculated, and then the normalized text corresponding to the short text provided by the user is searched in the database based on the calculated similarity.

However, in this way, the similarity between the short text provided by the user and each normalized text in the database needs to be calculated one by one, and the calculation amount is large, so that the efficiency of searching the normalized text is not high.

In view of this, the present application provides a method for converting a target character string into a normalized character string, where when converting the target character string into the normalized character string, a first segmentation unit set is obtained by performing character segmentation on the target character string based on a specified segmentation length, and a normalized character string set corresponding to each segmentation unit in the first segmentation unit set is obtained by searching a preset index list for a normalized character string associated with each segmentation unit in the first segmentation unit set; then calculating the similarity between the target character string and each character string in the normalized character string set based on the editing distance, searching the normalized character string corresponding to the target character string in the normalized character string set based on the calculated similarity, realizing that when searching the normalized character string associated with the target character string in a database, the normalized character string set associated with the target character string can be quickly searched through an index list, calculating the similarity between the target character string and each character string in the normalized character string set based on the editing distance, then searching the normalized character string corresponding to the target character string in the normalized character string set based on the calculated similarity, thereby avoiding the need of calculating the similarity between the target character string and each normalized character string stored in the database one by one when searching the normalized character string corresponding to the target character string, the calculation amount can be reduced, and the searching efficiency of the normalized character string is improved.

The present application is described below with reference to specific embodiments and specific application scenarios.

Referring to fig. 1, fig. 1 is a flowchart illustrating a method for converting a target string into a normalized string, applied to a server, according to an embodiment of the present application, where the method performs the following steps:

step 101, performing traversal segmentation on a target character string based on a specified segmentation length to obtain a first segmentation unit set;

the server side can comprise a server, a server cluster or a cloud platform constructed based on the server cluster; the target string may include irregular short text input by the user.

For example, taking an application scenario of electronic commerce as an example, the server may be a cloud platform of an electronic commerce provider, the target character string may include an address and a company name input by a user, and the cloud platform may collect short texts input by the user, such as the address and the company name, and perform normalization processing on the short texts input by the user at a cloud end.

In this example, when the server collects a target character string input by a user, traversal segmentation can be performed on the target character string based on a specified segmentation length to obtain a plurality of segmentation units.

The appointed segmentation length can be set by a user according to actual service requirements; the traversal and segmentation refers to traversing the character strings one by one when the character strings are segmented, performing character segmentation on the character strings according to the character sequence, and after a segmentation unit is obtained through segmentation, moving the segmentation position to the right by one character and continuing segmentation until the segmentation is completed, so that the segmentation unit obtained through segmentation can cover the combination of all characters in the character strings.

For example, taking the target character string as the address short text "paypal company" input by the user as an example, assuming that the segmentation length set by the user is 3, traversing and segmenting the short text "paypal company" input by the user based on the segmentation length can obtain three segmentation units, such as "paypal", "paypal public" and "treasury company".

After the server performs traversal segmentation on the target character string, the server may generate a first segmentation unit set based on all segmentation units obtained by the segmentation.

The first segmentation unit set comprises segmentation units obtained after character segmentation is carried out on the target character string, and the number of the segmentation units stored in the first segmentation unit set is consistent with the number of the segmentation units obtained by character segmentation of the target character string by the server.

Step 102, searching a normalized character string associated with each segmentation unit in the first segmentation unit set in a preset index list to obtain a normalized character string set corresponding to the first segmentation unit set; the index list comprises a segmentation unit obtained by traversing and segmenting the normalized character string and the normalized character string associated with the segmentation unit;

in this case, the server may provide a database storing the normalized character strings locally in advance. For example, taking the target character string as the company name input by the user as an example, the database may be a yellow pages database storing normalized company names (i.e., the full names of the company names).

In an initial state, the server may also perform traversal segmentation on the normalized character strings stored in the database based on the specified segmentation length to obtain a plurality of segmentation units.

When the server performs traversal cutting on the normalized character strings stored in the database, the used cutting length and the specific cutting mode are kept the same as the cutting length when the server performs traversal cutting on the target character strings.

After the server performs traversal segmentation on the normalized character strings stored in the database, the server may generate a second segmentation unit set based on the segmentation units obtained by the segmentation.

The second segmentation unit set comprises segmentation units obtained after traversal segmentation is carried out on the normalized character strings stored in the database.

For example, taking the database as a yellow page database as an example, the server may perform traversal segmentation on normalized company names stored in the database one by one, and then generate the second segmentation unit set based on the segmentation units obtained after traversal segmentation.

After the server finishes segmenting the normalized character strings stored in the database, the server may count the normalized character strings associated with each segmentation unit in the second segmentation unit set in the database and the number of the associated normalized character strings.

When the server side counts the associated normalized character strings in the database, the server side can traverse the second segmentation unit set, read the segmentation units from the second segmentation unit set in sequence, and then match the read segmentation units with the normalized character strings stored in the database in sequence.

When the read segmentation unit is matched in any normalized character string in the database, the server side can consider that the segmentation unit and the normalized character string have an association relationship. After all the read segmentation units are matched with all the normalized character strings stored in the database, the server side can count the normalized character strings which are determined by matching and have the association relation with the segmentation units and the specific number of the normalized character strings which have the association relation with the segmentation units.

For example, assuming that the database is a yellow pages database, and the database stores the standardized company name "paypal network technology limited company" of paypal company, the server performs character segmentation on the standardized company name according to a specified segmentation length of 3, and then 8 segmentation units such as "paypal", "paypal network", "treasure network", "network technology", "technology available", "limited public", and "limited company" can be obtained.

When the server determines the normalized company names associated with the 8 segmentation units in the database, the server can read the 8 segmentation units in sequence and match the 8 segmentation units with the normalized company names stored in the database one by one; for example, assuming that the segmentation unit read by the user is "network technology", if the normalized company name of "aluba network technology limited company" is currently stored in the database, when the server matches the segmentation unit "network technology" with the normalized company name of "aluba network technology limited company", since the company name includes the segmentation unit of "network technology", the segmentation unit of "network technology" can be matched in the company name, and in this case, the server can determine that the segmentation unit of "network technology" has an association relationship with the "aluba network technology limited company". After the segmentation unit of the "network technology" is matched with all the normalized company names in the database, the server side can record the ID of the normalized company name associated with the segmentation unit in the database, and count the number of the normalized company names associated with the segmentation unit.

In this example, after the server calculates the number of the associated normalized character strings and the number of the associated normalized character strings in the database of each segmentation unit in the second segmentation unit set in the manner described above, the server may generate a corresponding index entry for each segmentation unit in the second segmentation unit set based on the calculated information, and then create an index list based on the generated index entries.

The index table entry is used for fast searching the partition unit. The basic structure of the index table entry may include three fields, namely a cut unit field, an associated normalized string field, and an associated normalized string number field.

Wherein, the segmentation unit field is used for recording a segmentation unit; the related normalized character string field is used for recording the counted normalized character strings which have the association relation with the segmentation units recorded by the segmentation unit field, and the related normalized character string quantity field is used for recording the counted normalized character strings which have the association relation with the segmentation units recorded by the segmentation unit field.

For example, taking the above character string as a company name and the above database as a yellow pages database as an example, assuming that the segmentation unit "pay treasure" is only associated with a normalized company name "pay network technology limited company" in the database, and the ID of the normalized company name "pay treasure network technology limited company" is 120, the index list generated by the segmentation unit may be as follows:

splitting unit	Associated company name	Number of associated company names
			Payment device	120	1

After the server generates corresponding index entries for each of the segmentation units in the second segmentation unit set based on the basic structure, the server may create an index list based on the index entries generated for all the segmentation units in the second segmentation unit set. The index list is composed of index table entries generated for all the segmentation units in the second segmentation unit set.

In this example, after the index list is created, the subsequent server performs character segmentation on the target character string to generate the first segmentation unit set, and the server may traverse the first segmentation unit set, sequentially match the segmentation units in the first segmentation unit set with the segmentation units in the index table entries in the index list, then extract the normalized character string recorded in the matched index table entry, and generate a corresponding normalized character string set for the target character string. At this time, the normalized character string in the normalized character string set is the finally determined search range in which the normalized character string corresponding to the target character string can be searched,

it should be noted that, in practical applications, the number of index entries included in the index list may be very large (for example, for a yellow pages database, the stored normalized company name may reach an order of ten million), and in this case, when the server matches the segmentation units in the first segmentation unit set in the index list, if the segmentation units in the first segmentation unit set are matched with the segmentation units in the index entries in the index list one by one, a problem of a long matching time may be caused.

Therefore, in order to reduce the matching time length and avoid matching the segmentation units in the first segmentation unit set with the segmentation units in the index table items one by one, after the creation of the index list is completed, the server can also control the number of the index table items in the index list and delete part of the index table items in the index list.

In an embodiment shown, the server may preset a first threshold, where a specific value of the first threshold may be set according to an actual requirement; such as 1000.

After the index list is created, the server may sort based on the "number of associated normalized character strings" field in the list, where the server may sort from large to small or from small to large, and is not particularly limited in practical application. After the sorting is completed, at this time, the index entries in the index list are sorted from large to small or from small to large according to the number of the recorded associated normalized character strings, and at this time, the server may delete the index entries, in which the number of the recorded associated normalized character strings in the index list is greater than the first threshold, based on the sorting.

Therefore, by the method, the number of the index entries in the index list can be effectively controlled, when the server matches the segmentation unit in the first segmentation unit set with the segmentation unit in the index entries in the index list, the server can only match the segmentation units in the part of the index entries, where the number of the associated normalized character strings in the list is smaller than the first threshold, so that the matching time can be shortened as much as possible without affecting the matching result.

For example, assume that the database is a yellow pages database, the target character string input by the user is "Paibao corporation", and the first threshold is 1000. After the server performs character segmentation on a normalized company name 'Paibao network technology limited company' stored in the database, the counted number of company names associated with each segmentation unit is respectively as follows:

1, Paibao; 1, a subsidiary net; 5, a treasure network; a network technique 1819; the technology comprises the following steps: 2120; the technology is 6851; the limited operation is as follows: 7654, grinding the mixture into powder; limited disclosure 9315; limit corporation 9968.

The server can delete the index entries of the segmentation units of 'network technology', 'technology with', 'technology limited', 'limited public' and 'limited company' with the number of the associated company names exceeding 1000 from the index list, and only the index entries of the segmentation units of 'pay treasure', 'pay treasure network' and 'treasure network' are reserved.

When the server reads the splitting unit from the first splitting unit set, and matches the splitting unit in the index list, although the index list only retains the index table entries of the splitting units "pay treasure", "pay treasure network", and "treasure network", the normalized company name of "pay treasure network technology limited company" is included in the normalized company names associated with the splitting units "pay treasure", "pay treasure network", and "treasure network". Therefore, the normalized string set finally determined by the server through the matching process still includes the normalized company name "paypal network technology limited company". Therefore, the matching speed can be improved, and the accuracy of the matching result is not influenced.

Certainly, in practical applications, if the number of the normalized strings associated in the database by the segmentation unit of a certain normalized string stored in the database is greater than the first threshold, all the index entries corresponding to the segmentation unit of the normalized string in the index list will be deleted, and at this time, the normalized string will not appear in the normalized string set created by the server for the target string.

In this case, if the normalized character string happens to be the normalized character string corresponding to the target character string input by the user, the problem that the normalized character string corresponding to the target character string cannot be found in the normalized character string set or the normalized character string corresponding to the target character string is found inaccurately may be caused.

In an embodiment shown in the present invention, in order to avoid the above situation, if a segmentation unit of a certain normalized character string stored in a database deletes all index entries in an index list, a server may add an index entry with the minimum number of recorded associated normalized character strings in the index entry corresponding to the segmentation unit to the index list again.

By this way, in the case that the number of the associated normalized character strings in the database of the segmentation unit of a certain normalized character string is greater than the first threshold, it is ensured that the index table entry corresponding to at least one segmentation unit in the segmentation unit of the normalized character string can still be retained in the index list, so that it is ensured that the normalized character string finally appears in the normalized character string set created by the server for the target character string.

In this example, when the server side sequentially matches the splitting units in the first splitting unit set with the splitting units in the index table entries in the index list, the server side may traverse the first splitting unit set, and sequentially match the splitting units in the first splitting unit set with the splitting units in the index table entries. When the splitting unit in the first splitting unit is matched with the splitting unit recorded by any index table entry in the index list, the server side can acquire the index table entry.

When all the segmentation units in the first segmentation unit set are matched in the index list, the server side can extract the normalized character strings recorded in the obtained index table items, and then generate a corresponding normalized character string set for the target character string based on the extracted normalized character strings.

At this time, the normalized character string in the normalized character string set is the finally determined search range in which the normalized character string corresponding to the target character string can be searched, the subsequent server can respectively calculate the similarity between the target character string and each character string in the normalized character string set, and then determine the planned character string associated with the target character based on the similarity.

It should be noted that, when the server extracts the normalized character strings recorded in the acquired index table items, not only all the normalized character strings recorded in the acquired index table items may be extracted by default, but also only some of the normalized character strings recorded in the acquired index table items may be extracted.

In an embodiment shown, the server may preset a second threshold; wherein the second threshold may be lower than the first threshold. For example, when the first threshold is 1000, the second threshold may be set to 100.

When extracting the normalized character strings recorded in the acquired index table entry, the server may count the index table entries in which the number of the associated normalized character strings recorded in the acquired index table entry is lower than the second threshold, and then extract only the counted normalized character strings recorded in the index table entries.

For example, assuming that the database is a yellow pages database, the target character string input by the user is "paypal", the second threshold is 100, and the index table entry obtained by the server is as shown in the following table:

when extracting the normalized character strings recorded in the obtained index table entry, the server may count the index table entries in which the number of the associated normalized character strings recorded in the index table entry is lower than 100, and then extract only the associated company names recorded in the index table entries of the segmentation units "pay treasure", "pay treasure network", "treasure network", in which the number of the associated company names recorded in the table above is lower than 100.

By the method, the server side can control the number of the character strings in the normalized character string set, so that the search range of the normalized character strings corresponding to the target character strings can be narrowed.

103, calculating the similarity between the target character string and each character string in the normalized character string set;

in this example, after the server generates a corresponding normalized string set for the target string based on the extracted normalized string, the server may calculate the similarity between the target string and each string in the normalized string set.

In one illustrated embodiment, the server may calculate the similarity between the target string and each string in the normalized string set based on the edit distance. When the server calculates the edit distance between the target character string and each character string in the normalized character string set, the server can count the edit times when the target character string is converted into each character string in the normalized character string set, and then the counted edit times are used as the edit distance between the target character string and each character string in the normalized character string set.

The editing operation may generally include operations such as adding, deleting, replacing, and transposing.

For example, in implementation, the above edit distance may be a universal Levenshtein distance, or may be a Damerau-Levenshtein distance.

The universal Levenshtein distance is only used for counting the times of editing operations such as adding, deleting and replacing, so that when the server side adopts the universal Levenshtein distance, the editing times of converting a target character string into each character string in the normalized character string set by adding one character, deleting one character and replacing one character can be counted, and then the editing times is set as the editing distance between the target character string and each character string in the normalized character string set.

The Damerau-Levenshtein distance generally needs to count the times of editing operations such as adding, deleting, replacing, transposition and the like, so when the Damerau-Levenshtein distance is adopted by the server, the editing times when a target character string is converted into each character string in the normalized character string set by adding one character, deleting one character, replacing one character and transposing one character can be counted, and then the editing times are set as the editing distance between the target character string and each character string in the normalized character string set.

In this example, after the server calculates the edit distance between the target character string and each character string in the normalized character string set, the similarity between the target character string and each character string in the normalized character string set may be calculated based on the calculated edit distance.

The target character string and each character string in the normalized character string set may have a certain difference in character string length, and the difference in character string length may cause an inaccurate calculation result of the similarity.

Therefore, in order to improve the accuracy of similarity calculation, when the server calculates the similarity between the target character string and each character string in the normalized character string set based on the edit distance, the server may adapt the lengths of the target character string and each character string in the normalized character string set based on the edit distance between the target character string and each character string in the normalized character string set, and then calculate the similarity based on the lengths of the adapted target character string and each character string in the normalized character string set, so that when the similarity calculation is performed on the target character string after length adaptation and each character string in the normalized character string set, the influence of the length difference of the character strings on the similarity calculation result may be reduced to the greatest extent, and the accuracy of the similarity calculation is improved.

In an illustrated embodiment, when the server adapts the lengths of the target character string and each character string in the normalized character string set based on the calculated edit distance, the server may calculate a maximum value and a minimum value of the lengths of the target character string and each character string in the normalized character string set, and then the server may subtract the calculated edit distance from the maximum value or add the calculated edit distance to the minimum value to reduce the length difference between the target character string and each character string in the normalized character string set, thereby achieving the purpose of adapting the lengths of the target character string and each character string in the normalized character string set.

For example, assume that the target character string is ABCD, each character string in the normalized character string set is AFCDEG, the length of the target character string is 4, the length of each character string in the normalized character string set is 6, and the edit distance between the target character string and the normalized character string is 3 (one replacement edit, two new edits). When the server side adapts the lengths of the target character string and each character string in the normalized character string set, the editing distance 3 can be added to the length 4 of the target character string, after the adaptation is finished, the adaptation length of the target character string is 7, and the length difference between the adaptation length of the target character string and each character string in the normalized character string set is reduced. Or, when the server side adapts the target character string and the length of each character string in the normalized character string set, the editing distance 3 may be subtracted from the length 6 of each character string in the normalized character string set, and after the adaptation is finished, the adaptation length of each character string in the normalized character string set is 3, and the difference between the adaptation length of each character string in the normalized character string set and the length of each character string in the normalized character string set is reduced.

Of course, in practical applications, when the server adapts the lengths of the target character string and each character string in the normalized character string set based on the edit distance, other implementation manners are possible besides subtracting the edit distance between the maximum value of the lengths of the target character string and each character string in the normalized character string set or adding the edit distance between the minimum value of the lengths of the target character string and each character string in the normalized character string set, and this embodiment is not listed one by one.

In this example, after the server performs adaptation on the length of the target character string and the length of each character string in the normalized character string set, the server may calculate the similarity based on the length of each character string in the normalized character string set and the target character string after the adaptation.

In an embodiment shown in the present disclosure, after the lengths of the target string and each string in the normalized string set are adapted, the server may calculate a ratio between a minimum value and a maximum value of the lengths of the target string and each string in the normalized string set after the adaptation is completed, where the ratio is a numerical value between 0 and 1, and therefore the server may characterize the target string and each string in the normalized string set based on the ratio.

On one hand, if the server side subtracts the editing distance between the maximum value of the length of each character string in the target character string and the normalized character string set from the maximum value of the length of each character string in the normalized character string set, and adapts the length of each character string in the target character string and the normalized character string set, the server side can calculate the ratio between the maximum value and the minimum value after subtracting the editing distance when calculating the similarity of the target character string and the normalized character string set, and then the similarity of the target character string and the normalized character string set is represented by the ratio.

On the other hand, if the server adapts the lengths of the character strings in the target character string and the normalized character string set by adding the edit distance between the minimum value of the lengths of the character strings in the target character string and the normalized character string set, the server can calculate the ratio between the maximum value and the minimum value added with the edit distance when calculating the similarity between the target character string and the normalized character string set, and then characterize the similarity between the target character string and the normalized character string set by the ratio.

Based on this, assume that the target string is x, the length of each string in the normalized string set is y, the length of the target string x is | x |, the length of each string in the normalized string set is | y |, and the edit distance between the target string and the string is ds.

If the server adapts | x | and | y | by subtracting ds from the maximum of | x | and | y |, the server can calculate the similarity between the target string x and each string y in the normalized string set by the following formula:

if the server adapts | x | and | y | by adding ds to the minimum of | x | and | y |, the server can calculate the similarity between the target string x and each string y in the normalized string set by the following formula:

in the above two formulas, S represents the similarity between the target string x and each string y in the normalized string set. max (| x |, | y |) represents the maximum value in the lengths of the target character string and each character string in the normalized character string set; min (| x |, | y |) represents the minimum of the lengths of the target string and each string in the normalized string set. C represents a correction parameter introduced in the formula, and the correction parameter may be a constant greater than or equal to 0 (i.e. the formula may introduce a C value or may not introduce a C value), and the calculation result of the formula may be corrected by introducing the correction parameter in the formula.

The specific value of the correction parameter may be an engineering experience value set by a user according to actual requirements, and is not particularly limited in this disclosure; for example, in implementation, the correction parameter may be a smoothing parameter obtained by a user based on a smoothing method, and the calculation result of the formula may be corrected by introducing the smoothing parameter into the formula, so as to reduce an error of the calculation result of the formula.

In the formula, because the length of each character string in the target file and the normalized character string set is adapted according to the editing distance, when the similarity calculation is performed on the target character string after the length adaptation and each character string in the normalized character string set, the influence of the length difference of the character strings on the similarity calculation result can be reduced to the greatest extent, and the accuracy of the similarity calculation is improved.

For example, assuming that the target character string is "Paibao corporation", a normalized character string in the normalized character string set is "Paibao network technology Limited company", the edit distance between the two is 6 (6 new characters are added and edited for 6 times), the length of the target character string is 5, the length of the normalized character string is 11, and the similarity calculated by the above formula is

Or

(taking the C value as 0 for example).

Therefore, the similarity calculated by the formula effectively eliminates the influence of the length difference between the target character string and the character string in the normalized character string set on the similarity.

Of course, the similarity of the target character string to the character strings in the normalized character string set is calculated based on the edit distance, and there may be other ways other than the calculation method shown above. For example, in one illustrated calculation method, the similarity between the two can be calculated by the following formula:

wherein, in the above formula, S represents similarity; ds represents an edit distance; l represents a character string length.

When the similarity between the target string x and the string y in the normalized string set is calculated through the above formula, the value of L may be any one of min (| x |, | y |), max (| x |, | y |), or | x | + | y |, according to actual requirements.

In step 104, searching for a normalized character string corresponding to the target character string in the normalized character string set based on the calculated similarity.

In step 105, the target string is converted into the found normalized string.

In this example, after the server calculates the similarity between the target character string and each character string in the normalized character string set, the calculated similarity may be compared with a preset similarity threshold, and the normalized character string corresponding to the target character string is searched in the normalized character string set by determining whether the calculated similarity between the target character string and each character string in the normalized character string set reaches the preset similarity threshold.

If the similarity between the calculated target character string and any character string in the normalized character string set reaches the similarity threshold, the server side can judge that the character string is the normalized character string corresponding to the target character string.

For example, still taking the target character string as "paypal company" and the normalized character string in the normalized character string set as "paypal network technology limited company" as an example, assuming that the similarity threshold is 70%, the similarity between the character string calculated by the above formula and the character string "paypal network technology limited company" is 1, which is greater than the similarity threshold, so that the server can determine the "paypal network technology limited company" as the normalized character string of the character string "paypal company".

It should be noted that the similarity threshold may be set by a user according to actual requirements; for example, in implementation, the similarity threshold may be an engineering experience value, and an engineer may manually determine whether a large number of character strings are the same, and then analyze the result of the manual determination to set the similarity threshold; or the result of the manual judgment can be used as a data analysis sample, and the server side performs statistical analysis to set the similarity threshold.

Certainly, in practical applications, when a plurality of character strings whose similarity to the target character string reaches a similarity threshold exist in the normalized character string set, the server may determine all the character strings as the normalized character strings corresponding to the target character string, and then output the character strings to the user, so that the user manually confirms the optimal normalized character string.

In this example, after the server finds the normalized character string corresponding to the target character string in the normalized character string set based on the calculated similarity, the target character string may be forwarded to the found normalized character string for use.

For example, assuming that the company name input by the user is an irregular company name "paypal", the server normalizes the company name according to the above steps, finds the normalized company name "paypal network technology limited company" corresponding to the company name in the database, and then converts the irregular company name "paypal company" input by the user into "paypal network technology limited company" at the background for storage and use.

In the above embodiment, a first segmentation unit set is obtained by performing traversal segmentation on a target character string based on a specified segmentation length, and a normalized character string associated with each segmentation unit in the first segmentation unit set is searched in a preset index list to obtain a normalized character string set corresponding to the first segmentation unit set; then, by calculating the similarity between the target character string and each character string in the normalized character string set, searching the normalized character string corresponding to the target character string in the normalized character string set based on the calculated similarity, the target character string can be converted into the searched normalized character string, so that when the target character string is converted into the normalized character string, the normalized character string set associated with the target character string can be quickly determined through an index list, and the normalized character string corresponding to the target character string can be searched in the normalized character string set by calculating the similarity between the target character string and each character string in the normalized character string set, so that when the target character string is converted into the normalized character string, the similarity between the target character string and each normalized character string stored in the database does not need to be calculated one by one, therefore, the calculation amount can be reduced, and the searching efficiency of the normalized character string is improved.

Corresponding to the method embodiment, the application also provides an embodiment of the device.

Referring to fig. 2, the present application provides an apparatus 20 for converting a target string into a normalized string, which is applied to a server; referring to fig. 3, a hardware architecture related to a server of the lookup apparatus 20 for bearing the normalized character string generally includes a CPU, a memory, a non-volatile memory, a network interface, an internal bus, and the like; taking a software implementation as an example, the search device 20 for normalized character strings can be generally understood as a computer program loaded in a memory, and a logic device formed by combining software and hardware after being run by a CPU, where the device 20 includes:

the first segmentation module 201 is configured to perform traversal segmentation on the target character string based on the specified segmentation length to obtain a first segmentation unit set;

an obtaining module 202, configured to search a preset index list for a normalized character string associated with each segmentation unit in the first segmentation unit set, to obtain a normalized character string set corresponding to the first segmentation unit set; the index list comprises a segmentation unit obtained by traversing and segmenting the normalized character string and the normalized character string associated with the segmentation unit;

a calculating module 203, configured to calculate similarity between the target character string and each character string in the normalized character string set;

a searching module 204, configured to search, based on the calculated similarity, a normalized character string corresponding to the target character string in the normalized character string set;

a conversion module 205, configured to convert the target character string into the found normalized character string.

In this example, the apparatus 20 further comprises:

the second segmentation module 206 is configured to perform traversal segmentation on the normalized character strings stored in the database based on the specified segmentation length to obtain a second segmentation unit set;

a generating module 207, configured to generate corresponding index table entries for each segmentation unit in the second segmentation unit set; the information recorded by the index table entry comprises each segmentation unit of the normalized character string and the normalized character string associated with each segmentation unit in the database.

In this example, the information recorded in the index table entry further includes the number of normalized character strings associated with each splitting unit in the database;

the apparatus 20 further comprises:

an ordering module 208 configured to order the index entries in the index list based on the number;

a deleting module 209, configured to delete the index table entries recorded in the index list, where the number of the index table entries is greater than a preset first threshold.

In this example, the apparatus 20 further comprises:

the adding module 210 is configured to add the index entry with the minimum number recorded in the index entries corresponding to the splitting unit to the index list when all the index entries corresponding to the index list are deleted in the splitting unit obtained by performing character splitting on any normalized character string.

In this example, the obtaining module 202 is specifically configured to:

In this example, the calculating module 203 is specifically configured to:

In this example, the calculation module 203 is further configured to:

the similarity calculation formula includes:

or

In this example, the search module 204 is specifically configured to:

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. A method for converting a target string to a normalized string, the method comprising:

the preset index list is generated as follows: traversing and segmenting the normalized character strings stored in the database based on the specified segmentation length to obtain a second segmentation unit set; generating corresponding index table entries for all the segmentation units in the second segmentation unit set respectively; the information recorded by the index table entry comprises each segmentation unit of the normalized character string, the normalized character string associated with each segmentation unit in the database, and the number of the normalized character strings associated with each segmentation unit in the database; sorting the index table entries in the index list based on the number, deleting the index table entries of which the number is greater than a preset first threshold value, recorded in the index list, and adding the index table entry of which the number is the smallest, recorded in the index table entry corresponding to the splitting unit, to the index list when the corresponding index table entry of the splitting unit of any normalized character string in the index list is deleted;

calculating the editing distance between the target character string and each character string in the normalized character string set based on a preset similarity calculation formula; calculating the maximum value and the minimum value in the lengths of the target character string and each character string; the similarity calculation formula includes:

or

Wherein x represents a target character string, | x | represents the length of the target character; y represents a normalized string, | y | represents the length of the normalized string; max (| x |, | y |) represents the maximum of the lengths of the target string and the normalized string; min (| x |, | y |) represents the minimum of the target string and the normalized string length; ds represents the edit distance between the target string and the normalized string; c represents a correction parameter which is a constant greater than or equal to 0;

calculating the ratio of the maximum value to the minimum value after subtracting the editing distance, and representing the similarity between the target character string and each character string based on the ratio; or calculating the ratio of the maximum value to the minimum value added with the editing distance, and representing the similarity between the target character string and each character string based on the ratio;

2. The method of claim 1, wherein the searching for the normalized character string associated with each segmentation unit in the first segmentation unit set in a preset index list to obtain the normalized character string set corresponding to the first segmentation unit set comprises:

3. The method of claim 1, wherein the finding a normalized string in the set of normalized strings that corresponds to the target string based on the calculated similarity comprises:

4. An apparatus for converting a target string into a normalized string, the apparatus comprising:

a generating module, configured to generate corresponding index table entries for each segmentation unit in the second segmentation unit set; the information recorded by the index table entry comprises each segmentation unit of the normalized character string, the normalized character string associated with each segmentation unit in the database, and the number of the normalized character strings associated with each segmentation unit in the database;

a deleting module, configured to delete the index table entries recorded in the index list, where the number of the index table entries is greater than a preset first threshold;

the adding module is used for adding the index table entry with the minimum number recorded in the index table entry corresponding to the splitting unit into the index list when the corresponding index table entry in the index list is deleted;

the calculation module is used for calculating the editing distance between the target character string and each character string in the normalized character string set based on a preset similarity calculation formula; calculating the maximum value and the minimum value in the lengths of the target character string and each character string; the similarity calculation formula includes:

or

the calculation module is further used for calculating the ratio of the maximum value to the minimum value after subtracting the editing distance, and representing the similarity between the target character string and each character string based on the ratio; or calculating the ratio of the maximum value to the minimum value added with the editing distance, and representing the similarity between the target character string and each character string based on the ratio;

5. The apparatus of claim 4, wherein the obtaining module is specifically configured to:

6. The apparatus of claim 4, wherein the lookup module is specifically configured to: