JPH04279973A

JPH04279973A - Character string comparison system

Info

Publication number: JPH04279973A
Application number: JP3011449A
Authority: JP
Inventors: Susumu Tanaka; 進田中
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 1991-01-08
Filing date: 1991-01-08
Publication date: 1992-10-06

Abstract

PURPOSE:To perform various processings based on discrimination calculating the degree of resemblance to discriminate resembling character strings by numbers and discriminating it even in the case of disaccord that both character strings resemble very much. CONSTITUTION:This comparison system consists of a retrieving means 1, a comparing means 2, and a resemblance degree calculating means 3. In this case, the retrieving means 1 retrieves each character from both of character strings A and B in accordance with a prescribed algorithm. The comparing means 2 compares both characters retrieved by the retrieving means 1 and detects coincidence or disaccord between them as the result of this comparison and stores the detection result. The resemblance degree calculating means 3 calculates the degree of resemblance between both character strings in accordance with the comparison result of the comparing means 2 and outputs it. Thus, the problem that the quantity of discrimination information obtained by comparison between two character strings A and B is small is eliminated to evaluate the condition of difference between two character strings A and B by numbers in the case of character strings A and B different from each other.

Description

【発明の詳細な説明】[Detailed description of the invention]

【０００１】0001

【産業上の利用分野】本発明は、計算機におけるデータ
中に表われる文字列の比較を行なう方式に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a method for comparing character strings appearing in data in a computer.

【０００２】0002

【従来の技術】従来、この種の方式は、２つの文字列が
完全に一致するか否かの判定に限られていた。2. Description of the Related Art Conventionally, this type of method has been limited to determining whether two character strings completely match.

【０００３】図２は、従来の文字列比較方式の説明図で
ある。図示の方式では、文字列Ａ及びＢの比較を比較手
段２１によって行なう。文字列Ａ及びＢは、レジスタ等
の記憶装置に格納された各１バイトの文字により構成さ
れる文字の列である。比較手段２１は、文字列Ａ及びＢ
を、例えば、１文字ずつ比較するようにした論理回路に
より構成される。この比較手段２１は、文字列Ａ及びＢ
を構成する対応するすべての文字が一致する場合には、
一致の旨を出力し、文字列Ａ及びＢを構成する文字のう
ち、１文字でも一致しない場合には、不一致の旨を出力
する。FIG. 2 is an explanatory diagram of a conventional character string comparison method. In the illustrated system, character strings A and B are compared by comparison means 21. Character strings A and B are character strings each composed of 1-byte characters stored in a storage device such as a register. The comparison means 21 compares character strings A and B.
, for example, by a logic circuit that compares each character one by one. This comparison means 21 compares character strings A and B.
If all corresponding characters that make up match, then
A message to the effect of a match is output, and if even one character among the characters forming the character strings A and B does not match, a message to the effect of a mismatch is output.

【０００４】0004

【発明が解決しようとする課題】しかしながら、上述し
た従来の技術には、次のような問題があった。即ち、文
字列の一致のみが判定され、２つの文字列が全く異なる
文字から構成されているとか非常に類似しており一部の
みに相異があるといった情報は得られなかった。従って
、文字列の比較結果は、ごく限られた処理にしか使用す
ることができないという問題があった。[Problems to be Solved by the Invention] However, the above-mentioned conventional technology has the following problems. In other words, only a match between the character strings was determined, and no information such as whether two character strings were composed of completely different characters or were very similar and differed only in part could not be obtained. Therefore, there has been a problem in that the comparison results of character strings can only be used for very limited processing.

【０００５】本発明は、以上の点に着目してなされたも
ので、２つの文字列の比較において得られる判定情報が
少ないという問題点を除去し、２つの文字列が異なる場
合にどの程度異なるかという状況を数字で評価する方式
を提供することを目的とするものである。The present invention has been made with attention to the above points, and eliminates the problem that there is little judgment information obtained when comparing two character strings, and determines how much the two character strings differ when they differ. The purpose of this study is to provide a method for numerically evaluating the situation.

【０００６】[0006]

【課題を解決するための手段】本発明の文字列比較方式
は、双方の文字列から各文字を探索する探索手段と、当
該探索手段により探索された両文字を比較し、一致か不
一致かを検出して記憶する比較手段と、当該比較手段の
比較結果に従って双方の文字列の類似度を計算して出力
する類似度計算手段とから成ることを特徴とするもので
ある。[Means for Solving the Problems] The character string comparison method of the present invention includes a search means for searching each character from both character strings, and a comparison between both characters searched by the search means to determine whether they match or do not match. This method is characterized by comprising a comparison means for detecting and storing the information, and a similarity calculation means for calculating and outputting the similarity between both character strings according to the comparison result of the comparison means.

【０００７】[0007]

【実施例】以下、本発明の実施例を図面を参照して詳細
に説明する。図１は、本発明の文字列比較方式の実施例
のブロック図である。図示の方式は、探索手段１と、比
較手段２と、類似度計算手段３とから成る。Embodiments Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. FIG. 1 is a block diagram of an embodiment of the string comparison method of the present invention. The illustrated system includes a search means 1, a comparison means 2, and a similarity calculation means 3.

【０００８】探索手段１は、所定のアルゴリズムに従っ
て、双方の文字列Ａ及びＢから各１文字ずつを探索する
。このアルゴリズムの詳細については、後述する。比較
手段２は、探索手段１により探索された両文字を比較す
る。そして、この比較の結果、両文字が一致か不一致か
を検出して記憶する。類似度計算手段３は、比較手段２
の比較結果に従って双方の文字列の類似度を計算して出
力する。Search means 1 searches for one character each from both character strings A and B according to a predetermined algorithm. Details of this algorithm will be described later. Comparison means 2 compares both characters searched by search means 1. As a result of this comparison, whether the two characters match or do not match is detected and stored. The similarity calculation means 3 is the comparison means 2.
The similarity between both strings is calculated and output according to the comparison result.

【０００９】図３及び図４は、文字列探索のアルゴリズ
ムの説明図である。比較しようとする２つの文字列の先
頭の文字からそれぞれ“１”、“２”、“３”、…と文
字番号を振る。そして、比較手段２による比較の結果、
もし、文字番号“１”の文字同士が一致した場合は探索
及び比較を終了する。FIGS. 3 and 4 are explanatory diagrams of the character string search algorithm. Assign character numbers "1", "2", "3", etc. from the first character of the two character strings to be compared. As a result of the comparison by comparison means 2,
If the characters with the character number "1" match, the search and comparison are terminated.

【００１０】不一致の場合は、一方の文字列の文字番号
“１”の文字と、他方の文字列の文字番号“２”の文字
とを探索して比較する。この比較の結果、もし、これら
の文字同士が一致した場合は探索及び比較を終了する。不一致の場合は、一方の文字列の文字番号“２”の文字
と、他方の文字列の文字番号“１”の文字とを探索して
比較する。この比較の結果、もし、これらの文字同士が
一致した場合は探索及び比較を終了する。不一致の場合
は、一方の文字列の文字番号“２”の文字と、他方の文
字列の文字番号“２”の文字とを探索して比較する。こ
の比較の結果、もし、これらの文字同士が一致した場合
は、探索を終了する。If they do not match, the character with character number "1" in one character string is searched for and compared with the character with character number "2" in the other character string. As a result of this comparison, if these characters match, the search and comparison ends. If they do not match, the character with character number "2" in one character string is searched for and compared with the character with character number "1" in the other character string. As a result of this comparison, if these characters match, the search and comparison ends. If they do not match, the character with character number "2" in one character string is searched for and compared with the character with character number "2" in the other character string. As a result of this comparison, if these characters match, the search ends.

【００１１】不一致の場合は、一方の文字列の文字番号
“１”の文字と、他方の文字列の文字番号“３”の文字
とを探索して比較する。この比較の結果、もし、これら
の文字同士が一致した場合は探索及び比較を終了する。不一致の場合は、同様にして探索及び比較を続ける。こ
の探索の順序を図中の番号で示す。また、これを表で示
せば、図４のようになる。一方、文字の一致により探索
が終了した場合は、次の文字から文字番号を振り直して
図３及び図４のアルゴリズムに従った探索を繰り返す。If they do not match, the character with character number "1" in one character string is searched for and compared with the character with character number "3" in the other character string. As a result of this comparison, if these characters match, the search and comparison ends. If there is a mismatch, the search and comparison continue in the same way. The order of this search is indicated by the numbers in the figure. Moreover, if this is shown in a table, it will be as shown in FIG. On the other hand, if the search ends due to a match of characters, the character numbers are reassigned from the next character and the search is repeated according to the algorithms of FIGS. 3 and 4.

【００１２】図１では、一致した文字同士を線で結んで
示している。即ち、第１回目の探索及び比較により、文
字列Ａの第１番目の文字“ａ”と、文字列Ｂの第１番目
の文字“ａ”との一致が検出されるので、これらを線で
結んで示す。In FIG. 1, matched characters are shown connected by lines. In other words, the first search and comparison detects a match between the first character "a" of character string A and the first character "a" of character string B, so these can be drawn as a line. Tie and show.

【００１３】次に、第２回目の探索及び比較では、文字
列Ａの第２番目の文字“１”と、文字列Ｂの第２番目の
文字“２”とから、図３に従って文字番号を振り直して
比較及び探索を行なう。この結果、図１において、文字
列Ａの第２番目の文字“１”と、文字列Ｂの第３番目の
文字“１”との一致が検出されるので、これらを線で結
んで示す。Next, in the second search and comparison, character numbers are determined from the second character "1" of character string A and the second character "2" of character string B according to FIG. Redraw and compare and explore. As a result, in FIG. 1, a match is detected between the second character "1" of character string A and the third character "1" of character string B, and these are shown connected by a line.

【００１４】次に、第３回目の探索及び比較では、文字
列Ａの第３番目の文字“２”と、文字列Ｂの第４番目の
文字“３”とから、図３に従って文字番号を振り直して
比較及び探索を行なう。この結果、図１において、文字
列Ａの第４番目の文字“３”と、文字列Ｂの第４番目の
文字“３”との一致が検出されるので、これらを線で結
んで示す。この場合、同図中点線で示すような文字列Ａ
の第３番目の文字“２”と、文字列Ｂの第２番目の文字
“２”とは、一致したものとはされない。このような場
合まで一致したものとすると、同図において、文字列Ａ
の第１〜３番目の文字と、文字列Ｂの第１〜３番目の文
字とがすべて一致した場合と区別できなくなってしまう
からである。Next, in the third search and comparison, character numbers are determined from the third character "2" of character string A and the fourth character "3" of character string B according to FIG. Redraw and compare and explore. As a result, in FIG. 1, a match is detected between the fourth character "3" of character string A and the fourth character "3" of character string B, and these are shown connected by a line. In this case, the character string A as shown by the dotted line in the figure
The third character "2" in the character string B and the second character "2" in the character string B are not considered to be a match. Assuming that there is a match in such cases, in the same figure, the character string A
This is because the first to third characters of the character string B cannot be distinguished from the case where the first to third characters of the character string B all match.

【００１５】次に、第４回目の探索及び比較では、文字
列Ａの第５番目の文字“４”と、文字列Ｂの第５番目の
文字“５”とから、図３に従って文字番号を振り直して
比較及び探索を行なう。この結果、図１において、文字
列Ａの第６番目の文字“５”と、文字列Ｂの第５番目の
文字“５”との一致が検出されるので、これらを線で結
んで示す。以後、同様に第５回目以降の探索及び比較を
行なっていき、最終的に図示のような比較結果が得られ
る。図１の例では、文字列Ａの１２文字と、文字列Ｂの
１３文字のうち、７文字が一致していると見なされる。Next, in the fourth search and comparison, character numbers are determined from the fifth character "4" of character string A and the fifth character "5" of character string B according to FIG. Redraw and compare and explore. As a result, in FIG. 1, a match is detected between the sixth character "5" of character string A and the fifth character "5" of character string B, and these are shown connected by a line. Thereafter, searches and comparisons are performed in the same manner from the fifth time onwards, and the comparison results as shown in the figure are finally obtained. In the example of FIG. 1, 7 characters out of 12 characters of character string A and 13 characters of character string B are considered to match.

【００１６】次に、この比較結果に基づいた類似度の計
算例を説明する。代表的な例は、一致した文字数の長い
ほうの文字列の文字数に対する割合で表現したものであ
る。この計算例に従うと、図１の例の場合の類似度は、
（７／１３）×１００　＝５４％となる。他の例として
は、一致した文字数が同数でも、集中しているか分散し
ているかにより類似度が微妙に異なると解釈して、この
度合いを類似度に含めるものがある。この計算例に従う
と、図１の例の場合の類似度は、（７／１５）×１００
　＝６０％となる。いずれの計算例を採用するにしても
、類似度は、一致（類似度　１００％）か不一致（類似
度０％）の概念を拡張したものとなる。この類似度を使
用し、類似か非類似かを定める場合に、ボーダーライン
を自由に定めて判定することができる。Next, an example of calculating the degree of similarity based on this comparison result will be explained. A typical example is expressed as a ratio of the number of matched characters to the number of characters in the longer string. According to this calculation example, the similarity in the case of the example in Figure 1 is
(7/13) x 100 = 54%. Another example is one that interprets that even if the number of matched characters is the same, the degree of similarity differs slightly depending on whether the characters are concentrated or dispersed, and this degree is included in the degree of similarity. According to this calculation example, the similarity in the case of the example in Figure 1 is (7/15) x 100
=60%. Regardless of which calculation example is adopted, the degree of similarity is an extension of the concept of match (similarity 100%) or mismatch (similarity 0%). When determining similarity or dissimilarity using this degree of similarity, a borderline can be freely defined for determination.

【００１７】図５は、本発明の方式と従来の方式の相違
の説明図である。文字列Ａ及びＢがそれぞれ図示のよう
な５文字から成る場合、従来の方式では、５文字が完全
に一致するときのみ“１”が出力された。そして、それ
以外のときは、０が出力された。一方、本発明の方式で
は、５文字のうち１文字のみが不一致で他の４文字が一
致するときは、０．８　が出力される。また、５文字の
うち２文字が一致し、他の３文字が不一致のときは、０
．４　が出力される。そして、５文字のすべてが不一致
のときには、０が出力される。FIG. 5 is an explanatory diagram of the difference between the method of the present invention and the conventional method. When character strings A and B each consist of five characters as shown, in the conventional system, "1" is output only when the five characters completely match. In other cases, 0 is output. On the other hand, in the method of the present invention, when only one character out of five characters does not match and the other four characters match, 0.8 is output. Also, if 2 characters out of 5 characters match and the other 3 characters do not match, 0
．． 4 is output. If all five characters do not match, 0 is output.

【００１８】図６は、本発明の方式の使用例を示す図で
ある。この図において、ファイルＡは、プログラムの命
令であり、“ａａａ”を１００回印刷出力することを意
味する。一方、ファイルＢも、プログラムの命令であり
、“ｂｂｂ”を１００　回印刷出力することを意味する
。これらのファイルＡ及びＢは、印刷内容である“ａａ
ａ”及び“ｂｂｂ”の部分のみが異なり、他の部分は、
完全に一致している。従来は、ファイルＡ、Ｂの２行目
同士は、不一致のものとしてそれぞれ別個の取扱いがさ
れていた。しかしながら、本発明の方式を使用すること
により、例えば、行単位での２つのファイルの比較で、
類似度の高い行同士を効果的に対応づける等の効率的な
取扱いをすることができる。FIG. 6 is a diagram showing an example of the use of the method of the present invention. In this figure, file A is a program instruction and means to print out "aaa" 100 times. On the other hand, file B is also a program command and means to print out "bbb" 100 times. These files A and B are print contents “aa
Only the “a” and “bbb” parts are different; the other parts are:
They match perfectly. Conventionally, the second lines of files A and B were treated separately as being mismatched. However, by using the method of the present invention, for example, when comparing two files line by line,
It is possible to perform efficient handling such as effectively associating rows with a high degree of similarity.

【００１９】[0019]

【発明の効果】以上説明したように、本発明の文字列比
較方式によれば、類似文字列の判定を行なうようにした
ので、次のような効果がある。即ち、文字列が完全に一
致せずに、従来は、不一致とされてしまった場合にも、
両文字列が極めて類似したものであるという判断を行な
うことができ、この判断を基にして種々の処理を行なう
ことが可能となる。[Effects of the Invention] As explained above, according to the character string comparison method of the present invention, similar character strings are determined, so that the following effects can be obtained. In other words, even if the character strings do not match completely and would have been considered unmatched in the past,
It is possible to determine that both character strings are extremely similar, and it is possible to perform various processes based on this determination.

【図面の簡単な説明】[Brief explanation of the drawing]

【図１】本発明の文字列比較方式の実施例のブロック図
である。FIG. 1 is a block diagram of an embodiment of a string comparison method of the present invention.

【図２】従来の文字列比較方式の説明図である。FIG. 2 is an explanatory diagram of a conventional character string comparison method.

【図３】文字列探索のアルゴリズムの説明図である。FIG. 3 is an explanatory diagram of a character string search algorithm.

【図４】文字列探索のアルゴリズムの説明図である。FIG. 4 is an explanatory diagram of a character string search algorithm.

【図５】本発明の方式と従来の方式の相違の説明図であ
る。FIG. 5 is an explanatory diagram of the difference between the method of the present invention and the conventional method.

【図６】本発明の方式の使用例を示す図である。FIG. 6 is a diagram illustrating an example of the use of the method of the present invention.

【符号の説明】[Explanation of symbols]

１　　探索手段２　　比較手段３　　類似度計算手段 1 Search means 2 Comparison means 3 Similarity calculation means

Claims

【特許請求の範囲】[Claims]

【請求項１】　　双方の文字列から各文字を探索する探
索手段と、当該探索手段により探索された両文字を比較
し、一致か不一致かを検出して記憶する比較手段と、当
該比較手段の比較結果に従って双方の文字列の類似度を
計算して出力する類似度計算手段とから成ることを特徴
とする文字列比較方式。Claim 1: a search means for searching each character from both character strings, a comparison means for comparing both characters searched by the search means, detecting whether they match or not, and storing the result; A character string comparison method comprising a similarity calculation means for calculating and outputting a similarity between both character strings according to a comparison result.