JPH07141166A

JPH07141166A - Program analyzing method using cluster analysis

Info

Publication number: JPH07141166A
Application number: JP5290173A
Authority: JP
Inventors: Toshihiko Oda; 利彦小田
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1993-11-19
Filing date: 1993-11-19
Publication date: 1995-06-02

Abstract

PURPOSE:To generate a cluster method with a program design concept indicating information hiding by defining a distance between important data elements to be clustered in the case of adopting cluster analysis as a method for statically analyzing a source code of a program. CONSTITUTION:A distance scale between data elements to be clustered is defined as a difference between overall entropy values before and after clustering two entities, the entropy of each instance in the whole entity space is found out by using the appearance probability of an event that a certain instance of an attribute included in certain entity appears, the entropy of the attribute itself is calculated from the total of these entropy values, and then two objective entities are clustered. Consequently the distance scale is found out as the decrement of all entropy values of the attributes included in respective entities.

Description

【発明の詳細な説明】Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、プログラムのソースコ
ードを静的に分析し、そのプログラム構造やデザインレ
ベルの情報を取出して、ソフトウエアの開発や保守作業
に利用可能なクラスタ分析を用いたプログラム解析方法
に関する。BACKGROUND OF THE INVENTION The present invention uses a cluster analysis which can be used for software development and maintenance work by statically analyzing the source code of a program, extracting information on the program structure and design level. Program analysis method.

【０００２】[0002]

【従来の技術】従来、プログラムを静的に解析する手法
の一つとして、クラスタ分析を導入するようにしたもの
がある。このようなクラスタ分析を適用してプログラム
構造を見つけ出す研究に関しては、例えば、データ束縛
のメトリックを用いてコンポーネント間のデータ送受の
強さを計量し、これに基づくコンポーネント間の距離か
らクラスタ分析した結果、システム階層的なサブシステ
ムの構成を生成するようにしたものがある。これは、
〔ＳＢ８９〕（Ｒ．Ｗ．Ｓelby,Ｖ．Ｒ．Ｂasili,Ｅrro
r Ｌocalization Ｄuring Ｓoftware Ｍaintenance：Ｇ
enerating Ｈierachical Ｓystem Ｄescriptions from
the Ｓource Ｃode Ａlone，Ｉn Ｐroceedings of the
Ｃonference on Ｓoftware Ｍaintenance，Ｐhoenix，
Ｏctober，1989）により報告されている。これでは、さ
らに、エラーの個数やそれに要した労力からサブシステ
ムの特徴付けを行うようにしている。2. Description of the Related Art Conventionally, as one of methods for statically analyzing a program, there is a method in which cluster analysis is introduced. Regarding the research to find the program structure by applying such cluster analysis, for example, the result of cluster analysis from the distance between components based on the measurement of the strength of data transmission and reception between components using the data binding metric , There is a system to generate a hierarchical subsystem configuration. this is,
[SB89] (RW Selby, VR Basili, Erro
r Localization During Software Maintenance: G
enerating Hierachical System Descriptions from
the Source Code Alone, In Proceedings of the
Conference on Software Maintenance, Phoenix,
October, 1989). It also attempts to characterize the subsystem from the number of errors and the effort involved.

【０００３】また、データ束縛のメトリックを用いてク
ラスタ分析を行う、dbt というＡdaプログラムを対象に
したツールの開発に関する報告もある。これは、〔ＤＢ
９０〕（Ａ．Ｄelis．Ｖ．Ｒ．Ｂasili，Ｄata Ｂindin
g Ｔool：a Ｔool for Ｍeasurement Ｂased Ａda Ｓou
rce Ｒeusability and Ｄesign Ａssesment，Ｔechnica
l Ｒeport，Ｄep．of Ｃomputer Ｓcience，Ｕniversit
y of Ｍaryland，Ｍay，1990）により報告されているも
のである。dbt の出力、dendrogram（階層的クラスタ
木）は、再利用性の高いコードのピースを分離したり、
また、パッケージ化する方法を理解するために用いられ
る。There is also a report on the development of a tool for the Ada program called dbt, which performs cluster analysis using the data binding metric. This is [DB
90] (A. Delis. VR Basili, Data Bindin
g Toool: a Toool for Measurement Based Ada Sou
rce Reusability and Design Assessment, Technica
l Report, Dep. of Computer Science, Universit
y of Maryland, May, 1990). The output of dbt, dendrogram (hierarchical cluster tree), can be used to separate highly reusable pieces of code,
It is also used to understand how to package.

【０００４】他方、Ａdaモジュール（ルーティン、デー
タ、タイプ定義の集まり）に拡張したコヒージョンとカ
ップリングの定義とを提案し、その計量データが保守活
動の困難さを示唆することができた、という報告もあ
る。これは、〔ＢＭＢ９３〕（Ｌ．Ｃ．Ｂriand，Ｓ．
Ｍorasca，Ｖ．Ｒ．Ｂasili，Ｍeasuring and assesing
Maintainability at the Ｅnd of Ｈigh-Ｌevel Ｄesig
n，ＣＳＭ93）により報告されているものである。On the other hand, he proposed the extended definition of cohesion and coupling to the Ada module (collection of routines, data and type definitions), and said that the measurement data could indicate the difficulty of maintenance activities. There are also reports. This is described in [BMB93] (LC Briand, S. et al.
Morasca, V .; R. Basili, Measuring and assesing
Maintainability at the End of High-Level Desig
n, CSM93).

【０００５】また、情報量とシステムの複雑さとの関連
について、システムの分割を形式化する研究の報告があ
る。これは、〔ＰＷ９２〕（Ａn Ａutomated Ａpproach
toＩnformation Ｓystem Ｄecomposition，Ｄ．Ｐauls
on，Ｙair Ｗand，ＩＥＥＥＴransaction on Ｓoftware
Ｅngineering，ＳＥ-18(3)，Ｍarch，1992）により報
告されているものである。There is also a report of a study on the formalization of system partitioning regarding the relationship between the amount of information and the system complexity. This is [PW92] (An Automated Approach
to Information System Decomposition, D.I. Pauls
on, Yair Wand, IEEE Transactions on Software
Enginineering, SE-18 (3), March, 1992).

【０００６】[0006]

【発明が解決しようとする課題】ソフトウエア保守は、
ソフトウエア製品の寿命を伸ばしていくための継続的な
作業である。ソフトウエアシステムを開発した後、バグ
の修正とともにユーザの新たな要求を満足させたり、計
算機環境の変化に対応させるために、しばしばソフトウ
エアを変更又は改良しなければならない。[Problems to be Solved by the Invention] Software maintenance is
It is a continuous work to extend the life of software products. After developing a software system, it is often necessary to modify or improve the software in order to satisfy the new demands of users as well as to fix bugs and to respond to the changes in the computer environment.

【０００７】一方、ハードウエアの進歩に伴い、ソフト
ウエアシステムは、より複雑な上に大きくなってきてい
る。これにより、ソフトウエアの保守は、より困難かつ
コストのかかるものとなってきている。その上、限られ
た人的資源の元で、ソフトウエア保守のコストの増加
が、新たなソフトウエアの開発活動を抑制する要因とも
なってきている。On the other hand, with the progress of hardware, software systems are becoming more complicated and larger. This has made software maintenance more difficult and costly. Moreover, under the limited human resources, the increase in the cost of software maintenance has become a factor to restrain the development activity of new software.

【０００８】ここに、開発プロセスと保守プロセスとの
相違に関しては、〔ＬＰＬＳ７８，Ｃｏ８９，ＢＲ９
１〕等で報告されているように、多くの議論がある。例
えば、開発プロセスは、ユーザの仕様を獲得し、その後
に設計作業に移ることになる。これに対して、保守プロ
セスではユーザの仕様を対象となるプログラムのデザイ
ンと照合し、新たな機能をどの個所に加えるか、さらに
は、その付加がどのように他の個所へ影響するかについ
ても明らかにしなければならない。このため、保守プロ
セスでは、元のプログラムを充分に理解することが重要
であり、保守者はそのために多くの時間を費やしている
ものである。Regarding the difference between the development process and the maintenance process, [LPLS78, Co89, BR9]
1] etc., there are many discussions. For example, the development process will capture user specifications and then move on to design work. On the other hand, in the maintenance process, the user's specifications are collated with the design of the target program, and the place where a new function is added, and further, how the addition affects other places is also examined. I have to clarify. Therefore, in the maintenance process, it is important to fully understand the original program, and the maintainer spends a lot of time therefor.

【０００９】即ち、あるプログラムを理解するには、仕
様や設計に関するドキュメントを利用することから始め
るが、多くの場合、このようなドキュメントからは充分
な情報が得られなかったり、或いは、プログラムが変更
される度にドキュメントを更新することを怠った状況下
で、ドキュメントとプログラムとの整合性が失われてい
ることもある。この結果、保守技術者は、ソースコード
を直に読むことによって、プログラムがどのように振る
舞い、かつ、そのコンポーネントがどのような役割を行
うか、を学ばなければならなくなる。しかし、これには
多くの時間を費やす辛い作業となるものである。That is, in order to understand a certain program, it is necessary to use a document related to specifications and design, but in many cases, sufficient information cannot be obtained from such a document, or the program is changed. The inconsistency between the document and the program may be lost if the document is not updated each time. As a result, maintenance engineers must learn how a program behaves and what its components do by reading the source code directly. However, this can be a daunting task that consumes a lot of time.

【００１０】しかして、本発明は、保守プロセスにおけ
るプログラムの理解を支援する実際的かつ洗練された計
算機環境を構築することを目的とするものである。この
ため、仕様や設計レベルにあるプログラムの抽象情報を
ソースコードより自動的に抽出し、それを提供するため
の手法を提供するものである。Therefore, the object of the present invention is to construct a practical and sophisticated computer environment that supports the understanding of programs in the maintenance process. Therefore, it provides a method for automatically extracting the abstract information of the program at the specification or design level from the source code and providing it.

【００１１】[0011]

【課題を解決するための手段】請求項１記載の発明で
は、クラスタリングの対象となるデータ要素間の距離尺
度を２つのエンティティをクラスタ化する前と後との全
エントロピー値の差と定義し、あるエンティティが持つ
属性のあるインスタンスが出現するという事象の出現確
率を用いてエンティティ空間全体における各インスタン
スのエントロピーを求め、これらのエントロピーの合計
から属性自身のエントロピーを計算した後、対象とする
２つのエンティティをクラスタ化することにより各々の
エンティティの持つ属性の全エントロピーの減少として
前記距離尺度を求めるようにした。According to a first aspect of the invention, a distance measure between data elements to be clustered is defined as a difference between all entropy values before and after clustering two entities, The entropy of each instance in the entire entity space is calculated using the occurrence probability of the phenomenon that an instance with an attribute of an entity appears, and the entropy of the attribute itself is calculated from the sum of these entropies, and then the two target By clustering the entities, the distance measure is obtained as a reduction in the total entropy of the attributes of each entity.

【００１２】請求項２記載の発明では、クラスタの対象
となるエンティティとしてプログラム内のモジュールを
対象とし、距離を求めるための属性をモジュール呼出
し、外部変数、マクロ、及びメンバを含むタイプの４種
のプログラム要素なる前記モジュール内の外部参照の出
現という事象として、２つのエンティティをクラスタ化
する前と後との全エントロピー値の差をクラスタリング
の対象となるデータ要素間の距離尺度とする距離定義を
適用したクラスタ分析を行うようにした。According to the second aspect of the present invention, a module in a program is targeted as a target entity of a cluster, an attribute for obtaining a distance is called from a module, and four types of types including an external variable, a macro, and a member are included. As an event of appearance of an external reference in the module which is a program element, a distance definition is applied in which a difference between all entropy values before and after clustering two entities is a distance measure between data elements to be clustered. The cluster analysis was performed.

【００１３】請求項３記載の発明では、クラスタの対象
となるエンティティとしてモジュール呼出し、外部変
数、マクロ、及びメンバを含むタイプの４種のプログラ
ム要素を対象とし、これらのプログラム要素が外部参照
として出現したモジュールの集まりをそのエンティティ
の属性として定義し、２つのエンティティをクラスタ化
する前と後との全エントロピー値の差をクラスタリング
の対象となるデータ要素間の距離尺度とする距離定義を
適用したクラスタ分析を行うようにした。According to the third aspect of the present invention, four types of program elements of a type including a module call, an external variable, a macro, and a member are targeted as an entity to be a cluster, and these program elements appear as external references. A cluster that applies a distance definition that defines a collection of modules as attributes of the entity and uses the difference in total entropy value before and after clustering two entities as a distance measure between data elements to be clustered. Analysis was performed.

【００１４】[0014]

【作用】請求項１記載の発明においては、クラスタ分析
において重要となるクラスタリングの対象となるデータ
要素間の距離に関して、２つのエンティティをクラスタ
化する前と後との全エントロピー値の差と定義している
ので、情報隠蔽というプログラムの設計概念と適合した
クラスタの生成が可能となる。According to the first aspect of the invention, the distance between the data elements to be clustered, which is important in the cluster analysis, is defined as the difference between the total entropy values before and after the two entities are clustered. Therefore, it is possible to generate clusters that are compatible with the program design concept of information hiding.

【００１５】請求項２記載の発明においては、クラスタ
分析の結果として得られる情報は、情報隠蔽クラスタリ
ングによるモジュールの階層的クラスタ構成となり、１
つの大局的なプログラムの構成を表現したものとなり、
プログラムの理解に役立つ情報となる。また、情報隠蔽
の効果を反映することで、システムからサブシステムへ
の分割を示唆させることもできる。さらに、保守による
プログラム変更がどの程度のコストを要するか、或い
は、リスクを伴うか、といった判断材料をも提供するも
のとなり、モジュール変更においてその変更が及ぼす影
響範囲をランク付けて調べる等の対応がとれるものとな
る。According to the second aspect of the present invention, the information obtained as a result of the cluster analysis has a hierarchical cluster structure of modules by information hiding clustering.
It represents the composition of the two big programs,
It will be useful information for understanding the program. Further, by reflecting the effect of information hiding, it is possible to suggest the division from the system to the subsystem. In addition, it will also provide information on how much the program change due to maintenance will cost, or whether it will be risky. It can be taken.

【００１６】請求項３記載の発明においては、クラスタ
分析の結果による情報として、モジュール呼出しのクラ
スタからは部品候補の汎用性に関する情報が得られ、タ
イプや大域変数の出現を隠蔽するモジュールのクラスタ
からは部品候補の独立性に関する情報が得られる、とい
うように、プログラムの再利用において部品候補を的確
かつ容易に評価するに役立つものとなる。According to the third aspect of the present invention, as information based on the result of the cluster analysis, information about versatility of the component candidates is obtained from the cluster of module calls, and from the cluster of modules that hides the appearance of types and global variables. Provides useful information on the independence of component candidates, and is useful for accurately and easily evaluating component candidates in program reuse.

【００１７】[0017]

【実施例】本発明の一実施例を図面を参照しつつ、Ａ．概要Ｂ．階層的クラスタ分析Ｃ．エンティティと属性Ｄ．距離の計量Ｅ．クラスタリングの手続きＦ．各クラスタの性質と意味Ｇ．クラスタ分析の結果の利用なる項目に分け、各項目毎に順に説明する。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS An embodiment of the present invention will be described with reference to the drawings. Overview B. Hierarchical cluster analysis C. Entities and attributes D. Distance measurement E. Clustering procedure F. Properties and meaning of each cluster G. The results of the cluster analysis are divided into the items to be used, and each item will be explained in order.

【００１８】Ａ．概要プログラム構造やデザイン概念等のプログラムの抽象情
報を得るため、モジュール内の外部参照の出現とプログ
ラム言語における名前付けという２つのプログラム特徴
を利用することに焦点を当てたアプローチを行う。ここ
に、「モジュール」という用語は、プログラムのプライ
マリ単位として定義され、これは、閉じたサブルーチン
であり、独立にコンパイル可能である〔Ｍｙ７８参
照〕。Ｃ言語では、モジュールは関数を意味する。次
に、「外部参照」とは、モジュール外で定義され、か
つ、モジュール内に出現したプログラム要素を意味す
る。Ｃ言語では、ａ．モジュール呼出しｂ．モジュール内でアクセスされた外部変数ｃ．モジュール内で使われているマクロｄ．モジュール内で用いられているタイプやそのメンバなる４種のプログラム要素を意味する。A. Overview To obtain abstract information about programs such as program structure and design concept, we take an approach that focuses on using two program features: appearance of external references in modules and naming in programming language. Here, the term "module" is defined as the primary unit of a program, which is a closed subroutine and can be independently compiled [see My78]. In C language, a module means a function. Next, "external reference" means a program element defined outside the module and appearing in the module. In C, a. Module call b. External variables accessed within the module c. Macros used in the module d. It means the type used in the module and the four types of program elements that are its members.

【００１９】ところで、システム全体の複雑さの重要な
タイプに、モジュール間の関連性がある〔Ｍｙ７８参
照〕。このモジュール間の因果関係や依存関係による関
連性は、モジュール内における外部参照の出現が引き起
こしているといえる。さらに、外部参照は、モジュール
の振る舞いを局所的に理解することを困難にしている。
なぜなら、外部参照の定義自体は、ファイル上のモジュ
ールとは、別の場所に存在しているからである。By the way, an important type of complexity of the entire system is the relation between modules [see My78]. It can be said that the appearance of external references in modules causes this causal relationship and dependency relationship between modules. Furthermore, external references make it difficult to understand the behavior of modules locally.
This is because the definition of the external reference itself exists in a different place from the module on the file.

【００２０】このような外部参照の出現によるプログラ
ム理解の困難性に対処するため、階層的クラスタ分析を
用い、システム全体のプログラム構造を見出そうとする
ものである。In order to deal with the difficulty of program understanding due to the appearance of such external references, a hierarchical cluster analysis is used to find out the program structure of the entire system.

【００２１】Ｂ．階層的クラスタ分析ここでは、クラスタ分析手法による大規模プログラムの
プログラム構造を見出す方法について説明する。ここで
の「プロクラム構造」とは、プログラムに存在するプロ
グラム要素、即ち、モジュール呼出し、外部変数、マク
ロ、及びタイプの４種のプログラム要素に対して、ある
距離の定義に基づくクラスタ分析を行った結果として得
られる階層的なクラスタの構成を意味する。B. Hierarchical Cluster Analysis Here, a method for finding the program structure of a large-scale program by the cluster analysis method will be described. The "program structure" here means a cluster analysis based on the definition of a certain distance with respect to program elements existing in the program, that is, four types of program elements such as module call, external variable, macro, and type. It means the resulting hierarchical cluster configuration.

【００２２】クラスタ分析を行うために、２つの手法を
考える。第１の手法は、「情報隠蔽」という概念に基づ
いており、各モジュールでの外部参照の出現を、クラス
タの内部にできるだけ局所化していくように、モジュー
ルのクラスタを構成していく、というものである。この
ように、クラスタ内部に外部参照の出現という事象を隠
蔽していく、というクラスタの階層的構成から、情報隠
蔽という特性から導かれるプログラムの構造を調べる。
情報隠蔽は、よく知られているように、プログラム設計
上の重要な性質の一つであり、複合設計やオブジェクト
指向設計等の設計手法では、データや操作の実現の詳細
を隠すためにカプセル化と呼ばれる効果的な方法を導入
している。よって、このようなクラスタ分析の結果は、
プログラムの設計特性やその良否の評価に関して示すこ
とが可能と思われる。Two approaches are considered for performing cluster analysis. The first method is based on the concept of "information hiding", and constructs a cluster of modules so that the appearance of external references in each module is localized within the cluster as much as possible. Is. In this way, the structure of the program derived from the characteristic of information hiding is investigated from the hierarchical structure of the cluster, in which the phenomenon of appearance of external reference is hidden inside the cluster.
As is well known, information hiding is one of the important properties in program design, and in design methods such as complex design and object-oriented design, encapsulation is used to hide the details of the realization of data and operations. Has introduced an effective method called. Therefore, the result of such cluster analysis is
It seems possible to show the design characteristics of the program and the evaluation of its quality.

【００２３】第２の手法は、外部参照の出現の各モジュ
ールにおける同時性という事象に基づくものである。こ
こに、同じモジュール内にある２つの外部参照が出現す
ることは、それらの間に何らかの相互作用の可能性があ
ることを仮定し得る。例えば、２つの外部変数Ａ，Ｂが
多くのモジュールにおいて同時に出現している場合、外
部変数Ａ，Ｂ間には、データ束縛或いはデータフロー等
の関連性が存在する、と考えることが妥当である。The second approach is based on the phenomenon of simultaneity in the appearance of external references in each module. Here, the appearance of two external references that are in the same module may assume that there may be some interaction between them. For example, when two external variables A and B appear in many modules at the same time, it is appropriate to consider that there is a relation such as data binding or data flow between the external variables A and B. .

【００２４】Ｃ．エンティティと属性「エンティティ」とは、クラスタの要素となるデータ単
位のことであり、「属性」とは、クラスタ分析の実行に
必要なエンティティ間の距離を計量するために用いられ
るエンティティの性質のことである。C. Entity and Attribute An "entity" is a data unit that is an element of a cluster, and an "attribute" is the property of an entity used to measure the distance between entities required to perform cluster analysis. Is.

【００２５】[0025]

【表１】 [Table 1]

【００２６】表１に示される４つのエンティティ／属性
の組は、モジュールにおける外部参照（即ち、前述した
４種の属性）の出現の状況に対する、モジュール（エン
ティティ）の情報隠蔽手法に基づくクラスタリングを示
している。例えば、タイプCL1-1 では、より多くの同一
のモジュールの呼出しを共有し合うモジュールが集まる
ほど、より強いクラスタとなり、タイプCL1-2 では、よ
り同一の外部変数を共有し合うモジュールが集まるほ
ど、より強いクラスタとなることを示している。The four entity / attribute pairs shown in Table 1 indicate clustering based on the information hiding method of the module (entity) with respect to the situation of appearance of external references (that is, the above-mentioned four types of attributes) in the module. ing. For example, in type CL1-1, the more modules that share calls to the same module, the stronger the cluster, and in type CL1-2, the more modules that share the same external variable, the more It shows that it becomes a stronger cluster.

【００２７】次いで、この表１に示したエンティティ／
属性の組を置換えることにより、表２に示すような４つ
の組が定義される。Then, the entities shown in Table 1 /
By replacing the attribute sets, four sets as shown in Table 2 are defined.

【００２８】[0028]

【表２】 [Table 2]

【００２９】表２に示す４つの組は、外部参照の出現の
同時性に基づくクラスタリングに対応している。例え
ば、タイプCL2-1 では、ある２つのモジュールでそれら
のモジュール呼出しが同じモジュールで行われている場
合が多いほど（出現の同時性）、その２つのモジュール
の距離は近い値となる。即ち、あるクラスタ内の複数の
モジュールは、互いに同じモジュールにおいて出現する
という事象を持っていることを示す。また、タイプCL2-
2 では、ある２つの外部変数は、同じモジュールに共有
されるという事象の頻度が多ければ、その間の距離が近
くなることを示す。The four sets shown in Table 2 correspond to clustering based on the simultaneity of appearance of external references. For example, in type CL2-1, the more often two module calls are made by the same module (simultaneous appearance), the closer the distance between the two modules becomes. That is, it indicates that a plurality of modules in a certain cluster have the phenomenon that they appear in the same module. Also, type CL2-
In 2, some two external variables show that the more frequently they are shared by the same module, the closer they are.

【００３０】Ｄ．距離の計量クラスタ分析を行うためには、エンティティの全ての２
組間の距離が数量化される必要がある。そして、この距
離の定義が、クラスタ分析手法の本質となる。情報隠蔽
という概念に基づくクラスタリングでは、エンティティ
は、エンティティ空間における属性のインスタンスの出
現分布の広がりを最小限とするように、クラスタとして
構成することを意図している。属性のインスタンスの分
布の度合いは、エンティティ空間におけるエントロピー
として計算される。そこで、あるエンティティにある属
性のあるインスタンスが出現するという事象に対して、
その出現確率を次のように定義する。次式中、eiはｉ番
目のエンティティを示し、iajはｊ番目の属性のインス
タンスを示し、ＮＥはエンティティの総数を示し、ＮＩ
Ａは属性のインスタンスの総数を示し、ＮumＯfＯccr(i
ai，ej）はエンティティejにおけるインスタンスiai
の出現回数を示す。D. Distance metric To perform cluster analysis, all 2
The distance between pairs needs to be quantified. The definition of this distance is the essence of the cluster analysis method. In clustering based on the concept of information hiding, entities are intended to be organized as clusters so as to minimize the spread of the appearance distribution of attribute instances in the entity space. The degree of distribution of attribute instances is calculated as entropy in entity space. Therefore, for the event that an instance of an attribute of an entity appears,
The appearance probability is defined as follows. In the following equation, ei indicates the i-th entity, iaj Is the instance of the jth attribute, NE is the total number of entities, NI
A indicates the total number of attribute instances, and NumOfOccr (i
ai Ej ) Is the instance iai in entity ej
Indicates the number of appearances of.

【００３１】[0031]

【数１】 [Equation 1]

【００３２】エンティティ空間全体におけるインスタン
スiaiのエントロピーは次式で示される。Instance iai in the entire entity space The entropy of is expressed by the following equation.

【００３３】[0033]

【数２】 [Equation 2]

【００３４】この結果、その属性の全体のエントロピー
は次式のように計算できる。As a result, the total entropy of the attribute can be calculated by the following equation.

【００３５】[0035]

【数３】 [Equation 3]

【００３６】最後に、２つのエンティティei，ejの距離
尺度Ｄは、これらのエンティティei，ejをクラスタ化
（マージ）することによる属性分布の全エントロピーの
減少による求める。つまり、距離尺度Ｄは２つのエンテ
ィティei，ejをクラスタ化する前と後とにおける全エン
トロピー値の差として、次式のように計算することによ
り求められるものとして定義される。Finally, the distance measure D of the two entities ei and ej is obtained by reducing the total entropy of the attribute distribution by clustering (merging) these entities ei and ej. That is, the distance measure D is defined as a difference between the total entropy values before and after clustering the two entities ei and ej, and is calculated by the following equation.

【００３７】[0037]

【数４】 [Equation 4]

【００３８】このように定義された距離尺度Ｄは、外部
参照の出現の同時性に基づくクラスタリング手法にも同
様に適用される。The distance measure D defined in this way is similarly applied to the clustering method based on the simultaneity of appearance of external references.

【００３９】ここに、距離計量の例を図１を参照して説
明する。ここでは、同図（ａ）に示すようなモデルＡ，
Ｂ，Ｃ，Ｄを有する条件下で、同図（ｂ）に示すような
ＡとＢとをクラスタ化する例を示すものである。なお、
タイプとして、 struct T1{ int a,b,c } struct T2{ int p,q,r } struct T3{ int x,y,z } であるとする。このような条件下で、ＡとＢとの距離Ｄ
(Ａ，Ｂ)を計算すると、 D(A,B) ＝ H(〔A,B,C,D〕)−H′〔A∪B,C,D〕) ＝ H(T1)＋H(T2)−(H(T1)′＋H′(T2)) ＝ −4/8＊log(4/8)−3/8＊log(3/8)−1/8＊log(1/8)/＊H(T1) ＊/ −2/7＊log(2/7)−3/7＊log(3/7)−2/7＊log(2/7)/＊H(T2) ＊/ −(−4/5＊log(4/5)−1/5＊log(1/5)) /＊H′(T1)＊/ −(−4/6＊log(4/6)−2/6＊log(2/6)) /＊H′(T2)＊/ のようになる。An example of distance measurement will be described with reference to FIG. Here, the model A as shown in FIG.
It shows an example of clustering A and B as shown in FIG. 9B under the condition of having B, C and D. In addition,
Suppose that the types are struct T1 {int a, b, c} struct T2 {int p, q, r} struct T3 {int x, y, z}. Under these conditions, the distance D between A and B
Calculating (A, B), D (A, B) = H ([A, B, C, D])-H '[A∪B, C, D]) = H (T1) + H (T2) − (H (T1) ′ + H ′ (T2)) = −4 / 8 * log (4/8) −3 / 8 * log (3/8) −1 / 8 * log (1/8) / * H (T1) * / -2 / 7 * log (2/7) -3 / 7 * log (3/7) -2 / 7 * log (2/7) / * H (T2) * /-(-4 / 5 * log (4/5) -1 / 5 * log (1/5)) / * H '(T1) * /-(-4/6 * log (4/6) -2 / 6 * log ( 2/6)) / * H '(T2) * /.

【００４０】Ｅ．クラスタリングの手続き次いで、階層的クラスタの木を生成するクラスタリング
の手続きについて説明する。この手続きでは、一般に使
用されている類似行列を扱わず、リスト型のデータ構造
で操作する。この手続きは、以下の４つのステップから
なる。E. Clustering Procedure Next, a clustering procedure for generating a hierarchical cluster tree will be described. This procedure does not handle commonly used similarity matrices, but operates on a list-type data structure. This procedure consists of the following four steps.

【００４１】ステップ１初期クラスタセットクラスタリングの対象となる全てのエンティティを集め
て初期クラスタを形成する。即ち、最初は初期クラスタ
が含む全てのクラスタは、１つのエンティティと対応し
ている。しかし、ある特定のクラスタを初期クラスタに
指定することも可能である。例えば、外部ライブリィと
アプリケーションのモジュールとを別々のクラスタとし
て区別されている初期クラスタセットから始めることも
可能である。クラスタセット（ＣＳ）はこのプロセスの
繰返しの間、クラスタリングの中間結果を保持してい
る。クラスタセットは、二分木リストの形式で表現さ
れ、初期クラスタセットの例を示すと、Ｉnitial CS：（ａｂｃｄｅｆｇｈｉｊｋｌｍｎ）のようになる。Step 1 Initial Cluster Set All the entities to be clustered are collected to form an initial cluster. That is, initially, all the clusters included in the initial cluster correspond to one entity. However, it is also possible to designate a particular cluster as the initial cluster. For example, it is possible to start with an initial set of clusters where the external library and application modules are separated as separate clusters. The cluster set (CS) holds intermediate results of clustering during this iteration of the process. The cluster set is expressed in the form of a binary tree list, and an example of the initial cluster set is as follows: Initial CS: (abcdefghijkklmn).

【００４２】ステップ２距離の計量に基づきクラスタ
の候補の２組を選択クラスタ（最初はエンティティ）の全ての２組を取出
し、その距離を前述した距離尺度の定義に従い求める。
この結果、最も距離が短いクラスタの２組を選び出す。
これを、新規クラスタセット（ＮＣ）と呼び、ＮＣＳ：（(ａｂ)(ｃｄ)(ａｃ)(ｍｎ)）のように構成される。Step 2 Select two sets of cluster candidates based on the distance metric All two sets of clusters (initially entities) are taken and their distances are determined according to the definition of the distance scale described above.
As a result, two sets of clusters with the shortest distance are selected.
This is called a new cluster set (NC), and is configured as NCS: ((ab) (cd) (ac) (mn)).

【００４３】ステップ３終了条件もし、新規クラスタセットＮＣＳがＮＩＬであるか、又
は、クラスタセットＣＳと一致するならば、この手続き
は終了して、後述するＨＣＳを出力する。何れでもない
場合には、次のステップ４に移る。Step 3 Termination Condition If the new cluster set NCS is NIL or matches the cluster set CS, this procedure is terminated and the HCS described later is output. If neither, the process moves to the next step 4.

【００４４】ステップ４クラスタセットの更新このステップでは、新規クラスタセットＮＣＳによって
クラスタセットＣＳを更新して、ステップ２に戻る。こ
のステップは以下のように行われる。Step 4 Update Cluster Set In this step, the cluster set CS is updated by the new cluster set NCS, and the procedure returns to step 2. This step is performed as follows.

【００４５】ステップ４−１ＮＣＳの遷移の２組をマ
ージ新規クラスタセットＮＣＳにあるクラスタの組は遷移的
であると仮定し、同じ要素（クラスタ）を共有するクラ
スタ同士は、例えば、ＮＣＳ：（(ａｂ)(ｃｄ)(ａｃ)(ｍｎ)）→（(ａｂｃｄ)
(ｍｎ)）のように併合される。Step 4-1 Merging Two Sets of NCS Transitions It is assumed that the sets of clusters in the new cluster set NCS are transitional, and clusters sharing the same element (cluster) are, for example, NCS :( (ab) (cd) (ac) (mn)) → ((abcd))
(mn)).

【００４６】ステップ４−２ＣＳからＮＣＳにあるク
ラスタを除去クラスタセットＣＳから新規クラスタセットＮＣＳと重
複するクラスタを除去する。すると、クラスタ化から外
れたクラスタのリスト（ＲＣＳ）がＲＣＳ：（ｅｆｇｈｉｊｋｌ）のように得られる。Step 4-2 Remove clusters in NCS from CS Remove clusters that overlap with new cluster set NCS from cluster set CS. Then, a list of clusters out of clustering (RCS) is obtained as RCS: (efghijkl).

【００４７】ステップ４−３ＲＣＳとＮＣＳの併合リストＲＣＳと新規クラスタセットＮＣＳとを併合する
ことにより、クラスタセットＣＳを更新する。そのクラ
スタセットＣＳは、ＨＣＳ（＝ＨierachicalＣluster
Ｓet）の最後に加えられる。このようなＨＣＳはクラス
タ分析の最終結果として、のようにして出力される。Step 4-3 Merging RCS and NCS The cluster set CS is updated by merging the list RCS and the new cluster set NCS. The cluster set CS is HCS (= HierachicalCluster
Set) is added at the end. Such HCS is the final result of cluster analysis. Will be output.

【００４８】このような手続きの結果、生成された階層
型クラスタの木（ＨＣＳ）の例を、とした場合、例えば、図２に示すような木構造となる。An example of a hierarchical cluster tree (HCS) generated as a result of such a procedure is In such a case, the tree structure is as shown in FIG. 2, for example.

【００４９】Ｆ．各クラスタの性質と意味次に、表１，２にあるエンティティ／属性の組の各々に
ついて行った、異なるタイプのクラスタ分析の結果、得
られた各クラスタが有する性質とそのデザイン的視点か
ら見た意味について説明する。表１からは、情報隠蔽ク
ラスタリングが得られ、表２からは同時性クラスタリン
グが得られる。階層クラスタ木のレベルにおいて末端に
近く位置するクラスタ、即ち、クラスタリング手続きの
過程で早期に形成されたものを「強いクラスタ」と呼
び、逆に、木の根に近いクラスタを「弱いクラスタ」と
呼ぶものとする。F. Properties and meanings of each cluster Next, as a result of different types of cluster analysis performed on each of the entity / attribute pairs shown in Tables 1 and 2, the properties of each obtained cluster and their design viewpoints are shown. Explain the meaning. Information hiding clustering is obtained from Table 1, and concurrency clustering is obtained from Table 2. A cluster located near the end at the level of a hierarchical cluster tree, that is, one formed early in the process of the clustering procedure is called a "strong cluster", and conversely, a cluster close to the root of the tree is called a "weak cluster". To do.

【００５０】まず、情報隠蔽クラスタリングについて説
明する。情報隠蔽クラスタリングでは表１にあるエンテ
ィティ／属性の組合せから、モジュール内部に出現する
大域変数や関数呼出し等の外部参照という事象の分布を
最小限にする効果を反映した、モジュールの階層的なク
ラスタの構成が得られる。First, the information hiding clustering will be described. In the information hiding clustering, from the entity / attribute combination shown in Table 1, the effect of minimizing the distribution of events such as external variables such as global variables and function calls appearing inside the module is reflected, and the hierarchical clustering of modules is performed. The configuration is obtained.

【００５１】モジュール呼出し（マクロ同様）を隠蔽す
るクラスタは、例えば、それはある特定の外部ライブラ
リを呼出しているモジュールの集まりであったり、ま
た、共通するモジュール呼出しが多いほど、モジュール
の集まりは機能的に類似する可能性が高い等の特徴を有
している。A cluster that hides module calls (similar to macros) is, for example, a group of modules calling a specific external library, or the more common module calls are, the more functional the group of modules is. It has features such as a high possibility of being similar to.

【００５２】大域変数の出現の事象を隠蔽したクラスタ
は、外部カップリングと呼ばれるモジュール間のカップ
リングを持つモジュールの集まりであり、強いクラスタ
では、モジュール同士がより多くの大域変数を共有する
ため、カップリングが高く、モジュール自体の保守性や
変更容易性がよくないと考えられる。A cluster which hides the phenomenon of appearance of global variables is a group of modules having coupling between modules called external coupling, and in a strong cluster, the modules share more global variables. It is considered that the coupling is high and the maintainability and changeability of the module itself are not good.

【００５３】タイプの情報を隠蔽したクラスタは、情報
コヒージョン（Ｉnformation cohesion）と呼ばれる、
デザイン上、好ましい特性を有していると云え、強いク
ラスタでは、よりそのクラスタの情報コヒージョンが高
いものとなる。A cluster in which type information is hidden is called information cohesion.
It can be said that the cluster has a preferable property in terms of design, and a strong cluster has a higher information cohesion.

【００５４】次に、同時性クラスタリングについて説明
する。同時性クラスタリングでは表２にあるエンティテ
ィ／属性の組合せから、外部参照同士が同一のモジュー
ルによって共有されるという事象の分布を最小限にする
効果により、外部参照の階層的なクラスタが得られる。Next, concurrency clustering will be described. In concurrency clustering, a hierarchical cluster of external references is obtained from the entity / attribute combinations in Table 2 with the effect of minimizing the distribution of events where external references are shared by the same module.

【００５５】モジュール内呼出しという外部参照のクラ
スタでは、同一モジュールの中で静的（コード上）に呼
出される事象を共有するモジュール同士がクラスタを構
成している。つまり、そのクラスタ中のモジュールはセ
ットとなって、別のモジュールから同時に呼出される機
会が多い。強いクラスタには、利用頻度が高いというこ
とから、汎用なモジュールが存在している可能性が高
い。クラスタ木の最上位のクラスタは、１つのモジュー
ルだけから呼出されるモジュールの集まりとなってい
る。In an external reference cluster called an in-module call, modules sharing an event called statically (in code) within the same module form a cluster. That is, the modules in the cluster often become a set and are called by another module at the same time. It is highly possible that a general-purpose module exists in a strong cluster because it is used frequently. The highest cluster in the cluster tree is a collection of modules called from only one module.

【００５６】大域変数のクラスタでは、互いに同じモジ
ュールからアクセスされる機会が多い大域変数の集まり
から構成されている。クラスタ木の最上位には、互いに
排他的な大域変数のグループの分割が、それが存在して
いる場合、現れる。強いクラスタの中の大域変数は、互
いにデータ束縛或いは機能的役割等の関連性が強いと考
えられる。A global variable cluster is composed of a group of global variables that are frequently accessed by the same module. At the top of the cluster tree, a partition of a group of global variables that is mutually exclusive, if any, appears. Global variables in a strong cluster are considered to be strongly related to each other such as data binding or functional role.

【００５７】タイプのクラスタでは、大域変数のクラス
タと同様に、同じモジュールで同時に扱われるタイプ同
士が集まっている。従って、タイプはデータ概念と対応
しているとすると、ある強いクラスタに含まれるタイプ
同士は、それと対応するデータ概念間に強い関連が存在
するといえる。In the type cluster, similar to the global variable cluster, types handled simultaneously by the same module are collected. Therefore, if a type corresponds to a data concept, it can be said that types included in a certain strong cluster have a strong relationship between corresponding data concepts.

【００５８】Ｇ．クラスタ分析の結果の利用前述した説明では、開発終了したプログラムを対象に分
析を行うことが前提となっている。従って、ここでは、
開発工程が終了した後の、検査工程や保守工程におい
て、分析結果がどのような利用価値を持つかについて説
明する。G. Utilization of Cluster Analysis Results In the above description, it is premised that the analysis will be performed on the programs for which development has been completed. Therefore, here
The utility value of the analysis result in the inspection process and the maintenance process after the development process is completed will be described.

【００５９】まず、情報隠蔽クラスタリングによるモジ
ュールの階層的なラスタ構成は、１つの大局的なプログ
ラムの構成を表現したものであり、プログラムの理解に
役立つ情報となる。また、これはシステムからサブシス
テムへの分割を示唆することができる。また、様々な種
類のクラスタは、各々デザイン上の意味を解釈すること
ができる。特に、プログラム理解の初期に行う、トップ
ダウン的な理解過程に有意義な情報であると考えられ
る。First, the hierarchical raster structure of modules by information hiding clustering expresses one global program structure, and is useful information for understanding the program. It can also suggest a division of the system into subsystems. In addition, various types of clusters can each interpret a design meaning. In particular, it is considered to be meaningful information for the top-down comprehension process performed at the beginning of program understanding.

【００６０】次に、保守によるプログラムの変更がどの
程度のコストを要するか、或いは、リスクを伴うか、と
いう判断材料が与えられる。例えば、あるモジュールを
変更した場合、その変更が及ぼす影響範囲を、そのモジ
ュールが属するクラスタから近接するクラスタへと、ラ
ンク付けして調べることができる。また、カップリング
が強いクラスタ内のモジュールの変更については、より
リスクが高いと推定できる。大域変数やタイプについて
も、それらのクラスタからその変更が及ぼす影響範囲を
把握することができる。Next, a material for judging whether the cost of changing the program due to the maintenance or the risk is involved is given. For example, when a module is changed, the range of influence of the change can be examined by ranking the cluster to which the module belongs to a cluster close to the cluster. In addition, it can be estimated that the risk of the change of the module in the cluster with strong coupling is higher. With respect to global variables and types, it is possible to grasp the range of influence of the change from those clusters.

【００６１】プログラムの再利用については、充分な数
の部品を獲得するためには既存部品から部品として利用
価値のある候補を抽出する方法が有効と云われている。
そこで、部品候補を的確かつ容易に評価することが重要
となる。モジュール呼出しのクラスタから、部品候補の
汎用性に関する情報が得られ、また、タイプや大域変数
の出現を隠蔽するモジュールのクラスタから、部品候補
の独立性に関する情報が得られる。Regarding the reuse of programs, it is said that a method of extracting a candidate having utility value as a part from existing parts is effective in order to obtain a sufficient number of parts.
Therefore, it is important to accurately and easily evaluate the component candidates. Information about the versatility of the part candidates is obtained from the cluster of module calls, and information about the independence of the part candidates is obtained from the cluster of modules that hide the appearance of types and global variables.

【００６２】[0062]

【発明の効果】請求項１記載の発明によれば、クラスタ
リングの対象となるデータ要素間の距離尺度を２つのエ
ンティティをクラスタ化する前と後との全エントロピー
値の差と定義し、あるエンティティが持つ属性のあるイ
ンスタンスが出現するという事象の出現確率を用いてエ
ンティティ空間全体における各インスタンスのエントロ
ピーを求め、これらのエントロピーの合計から属性自身
のエントロピーを計算した後、対象とする２つのエンテ
ィティをクラスタ化することにより各々のエンティティ
の持つ属性の全エントロピーの減少として前記距離尺度
を求めるようにしたので、情報隠蔽というプログラムの
設計概念と適合したクラスタの生成が可能となる。According to the first aspect of the present invention, the distance measure between the data elements to be clustered is defined as the difference between the total entropy values before and after the two entities are clustered. The entropy of each instance in the entire entity space is calculated using the occurrence probability of the event that an instance with an attribute appears, and the entropy of the attribute itself is calculated from the sum of these entropies, and then the two target entities are Since the distance measure is obtained as a reduction of the total entropy of the attributes of each entity by clustering, it is possible to generate a cluster suitable for the design concept of the program called information hiding.

【００６３】請求項２記載の発明では、クラスタの対象
となるエンティティとしてプログラム内のモジュールを
対象とし、距離を求めるための属性をモジュール呼出
し、外部変数、マクロ、及びメンバを含むタイプの４種
のプログラム要素なる前記モジュール内の外部参照の出
現という事象として、請求項１記載の発明における距離
定義を適用したクラスタ分析を行うようにしたので、ク
ラスタ分析の結果として得られる情報が、情報隠蔽クラ
スタリングによるモジュールの階層的クラスタ構成とな
って、１つの大局的なプログラムの構成を表現したもの
となるため、プログラムの理解に役立つ情報を提供で
き、また、情報隠蔽の効果を反映することで、システム
からサブシステムへの分割を示唆させることもでき、さ
らに、保守によるプログラム変更がどの程度のコストを
要するか、或いは、リスクを伴うか、といった判断材料
をも提供し得るものとなり、モジュール変更においてそ
の変更が及ぼす影響範囲をランク付けて調べる等の対応
がとれるものとなる。According to the second aspect of the invention, a module in a program is targeted as an entity to be clustered, and an attribute for obtaining a distance is called from a module, and there are four types of types including an external variable, a macro, and a member. Since the cluster analysis to which the distance definition in the invention according to claim 1 is applied is performed as an event that an external reference appears in the module that is a program element, the information obtained as a result of the cluster analysis is based on the information hiding clustering. Since it becomes a hierarchical cluster structure of modules and expresses one global program structure, it can provide information useful for program understanding, and by reflecting the effect of information hiding, the system It can be suggested to split into subsystems, and in addition, maintenance It will also be possible to provide information such as how much cost a ram change will cost, or whether it will be risky, and it will be possible to take measures such as ranking and investigating the impact range of the change in the module change. Become.

【００６４】請求項３記載の発明では、クラスタの対象
となるエンティティとしてモジュール呼出し、外部変
数、マクロ、及びメンバを含むタイプの４種のプログラ
ム要素を対象とし、これらのプログラム要素が外部参照
として出現したモジュールの集まりをそのエンティティ
の属性として定義し、請求項１記載の発明における距離
定義を適用したクラスタ分析を行うようにしたので、ク
ラスタ分析の結果による情報として、モジュール呼出し
のクラスタからは部品候補の汎用性に関する情報が得ら
れ、タイプや大域変数の出現を隠蔽するモジュールのク
ラスタからは部品候補の独立性に関する情報が得られ
る、というように、プログラムの再利用において部品候
補を的確かつ容易に評価するに役立てることができる。According to the third aspect of the present invention, four types of program elements of a type including a module call, an external variable, a macro, and a member are targeted as entities to be clustered, and these program elements appear as external references. Since a cluster of modules is defined as an attribute of the entity, and cluster analysis is performed by applying the distance definition in the invention according to claim 1, as a result of the cluster analysis, a candidate of a part is selected from a cluster of module calls. Information about the versatility of the parts, and information about the independence of the part candidates from the cluster of modules that hides the appearance of types and global variables. It can be useful for evaluation.

【図面の簡単な説明】[Brief description of drawings]

【図１】本発明の一実施例における距離計量例を示す模
式図である。FIG. 1 is a schematic diagram showing an example of distance measurement according to an embodiment of the present invention.

【図２】ＨＣＳによる階層的クラスタ木構造例を示す説
明図である。FIG. 2 is an explanatory diagram showing an example of a hierarchical cluster tree structure by HCS.

─────────────────────────────────────────────────────
─────────────────────────────────────────────────── ───

【手続補正書】[Procedure amendment]

【提出日】平成５年１２月２８日[Submission date] December 28, 1993

【手続補正１】[Procedure Amendment 1]

【補正対象書類名】明細書[Document name to be amended] Statement

【補正対象項目名】００３０[Name of item to be corrected] 0030

【補正方法】変更[Correction method] Change

【補正内容】[Correction content]

【００３０】Ｄ．距離の計量クラスタ分析を行うためには、エンティティの全ての２
組間の距離が数量化される必要がある。そして、この距
離の定義が、クラスタ分析手法の本質となる。情報隠蔽
という概念に基づくクラスタリングでは、エンティティ
は、エンティティ空間における属性のインスタンスの出
現分布の広がりを最小限とするように、クラスタとして
構成することを意図している。属性のインスタンスの分
布の度合いは、エンティティ空間におけるエントロピー
として計算される。 D. Distance metric To perform cluster analysis, all 2
The distance between pairs needs to be quantified. The definition of this distance is the essence of the cluster analysis method. In clustering based on the concept of information hiding, entities are intended to be organized as clusters so as to minimize the spread of the appearance distribution of attribute instances in the entity space. The degree of distribution of the attributes of the instance, Ru is calculated as the entropy of the entity space.

【手続補正２】[Procedure Amendment 2]

【補正対象書類名】明細書[Document name to be amended] Statement

【補正対象項目名】００３１[Correction target item name] 0031

【補正方法】変更[Correction method] Change

【補正内容】[Correction content]

【００３１】まず、あるエンティティにおいてある属性
インスタンスが出現するという事象に対して、その出現
確率を次のように定義する。次式中、atr ins_iはｉ番目
の属性インスタンスを示し、e_jはｊ番目のエンティティ
を示し、ＮumＥtyはエンティティの総数を示し、ＮumＡ
trＩnsは属性インスタンスの総数を示し、ＮumＯfＯccu
r(atr ins_i，e_j）はエンティティe_jにおける属性インス
タンスatr ins_iの出現回数を示す。 First, a certain attribute in a certain entity
Appearance of an instance that appears
The probability is defined as follows. In the following equation, atr ins _i is the i-th
Is an attribute instance of e and _j is the jth entity
, NumEty indicates the total number of entities, and NumAty
trIns indicates the total number of attribute instances, and NumOfOccu
r (atr ins _i , e _j ) is the attribute ins in entity e _j .
Indicates the number of times the chest of drawers atr ins _i appears.

【手続補正３】[Procedure 3]

【補正対象書類名】明細書[Document name to be amended] Statement

【補正対象項目名】００３２[Name of item to be corrected] 0032

【補正方法】変更[Correction method] Change

【補正内容】[Correction content]

【００３２】[0032]

【数１】 [ Equation 1 ]

【手続補正４】[Procedure amendment 4]

【補正対象書類名】明細書[Document name to be amended] Statement

【補正対象項目名】００３３[Correction target item name] 0033

【補正方法】変更[Correction method] Change

【補正内容】[Correction content]

【００３３】ところで、ＮumＯfＯccur は、例えば、モ
ジュールＰ内でモジュールＭを何度も呼出しても、モジ
ュールＰにおけるモジュールＭの出現回数は最大１とし
ている。これは、モジュール間の関連の複雑さに、同一
モジュール内での出現回数は依存しないからである。し
かし、構造体タイプの属性の場合、構造体のメンバは個
々にカウントするため、１以上になることがある。 By the way, NumOfOccur is, for example,
No matter how many times module M is called in module P,
The maximum number of appearances of module M in module P is 1
ing. This is identical to the complexity of the association between modules
This is because the number of appearances in the module does not depend. Shi
However, in the case of a structure type attribute, the structure members are
Since it counts individually, it may be 1 or more.

【手続補正５】[Procedure Amendment 5]

【補正対象書類名】明細書[Document name to be amended] Statement

【補正対象項目名】００３４[Correction target item name] 0034

【補正方法】変更[Correction method] Change

【補正内容】[Correction content]

【００３４】全エンティテイにおける属性インスタンス
のエントロピーは次式のようになる。 Attribute instances in all entities
The entropy of is as follows.

【手続補正６】[Procedure correction 6]

【補正対象書類名】明細書[Document name to be amended] Statement

【補正対象項目名】００３５[Correction target item name] 0035

【補正方法】変更[Correction method] Change

【補正内容】[Correction content]

【００３５】[0035]

【数２】 [ Equation 2 ]

【手続補正７】[Procedure Amendment 7]

【補正対象書類名】明細書[Document name to be amended] Statement

【補正対象項目名】００３６[Correction target item name] 0036

【補正方法】変更[Correction method] Change

【補正内容】[Correction content]

【００３６】最後に、２つのエンティティe_p，e_qの距離
尺度Ｄは、これらのエンティティe_p，e_qをクラスタ化し
た場合の属性インスタンスの事前エントロピーと事後の
エントロピーの差の逆数により、次式のように求める。 Finally, the distance between the two entities e _p and e _q
The scale D clusters these entities e _p , e _q
If the entropy of the attribute instance and the after
The reciprocal of the entropy difference is used to obtain the following equation.

【手続補正８】[Procedure Amendment 8]

【補正対象書類名】明細書[Document name to be amended] Statement

【補正対象項目名】００３７[Name of item to be corrected] 0037

【補正方法】変更[Correction method] Change

【補正内容】[Correction content]

【００３７】[0037]

【数３】 [ Equation 3 ]

Claims

【特許請求の範囲】[Claims]

【請求項１】クラスタリングの対象となるデータ要素
間の距離尺度を２つのエンティティをクラスタ化する前
と後との全エントロピー値の差と定義し、あるエンティ
ティが持つ属性のあるインスタンスが出現するという事
象の出現確率を用いてエンティティ空間全体における各
インスタンスのエントロピーを求め、これらのエントロ
ピーの合計から属性自身のエントロピーを計算した後、
対象とする２つのエンティティをクラスタ化することに
より各々のエンティティの持つ属性の全エントロピーの
減少として前記距離尺度を求めるようにしたことを特徴
とするクラスタ分析を用いたプログラム解析方法。1. A distance measure between data elements to be clustered is defined as a difference between all entropy values before and after clustering two entities, and an instance with an attribute of an entity appears. After finding the entropy of each instance in the entire entity space using the occurrence probability of the event and calculating the entropy of the attribute itself from the sum of these entropies,
A program analysis method using cluster analysis, characterized in that the distance measure is obtained as a reduction of the total entropy of attributes of each entity by clustering two target entities.

【請求項２】クラスタの対象となるエンティティとし
てプログラム内のモジュールを対象とし、距離を求める
ための属性をモジュール呼出し、外部変数、マクロ、及
びメンバを含むタイプの４種のプログラム要素なる前記
モジュール内の外部参照の出現という事象として、２つ
のエンティティをクラスタ化する前と後との全エントロ
ピー値の差をクラスタリングの対象となるデータ要素間
の距離尺度とする距離定義を適用したクラスタ分析を行
うようにしたことを特徴とするクラスタ分析を用いたプ
ログラム解析方法。2. A module in a program is targeted as a target entity of a cluster, an attribute for obtaining a distance is called from the module, and four types of program elements of a type including external variables, macros, and members are included in the module. As an event of appearance of external reference, the cluster analysis is performed by applying the distance definition in which the difference between the total entropy values before and after clustering two entities is the distance measure between the data elements to be clustered. A program analysis method using cluster analysis characterized in that

【請求項３】クラスタの対象となるエンティティとし
てモジュール呼出し、外部変数、マクロ、及びメンバを
含むタイプの４種のプログラム要素を対象とし、これら
のプログラム要素が外部参照として出現したモジュール
の集まりをそのエンティティの属性として定義し、２つ
のエンティティをクラスタ化する前と後との全エントロ
ピー値の差をクラスタリングの対象となるデータ要素間
の距離尺度とする距離定義を適用したクラスタ分析を行
うようにしたことを特徴とするクラスタ分析を用いたプ
ログラム解析方法。3. A set of modules in which a module call, an external variable, a macro, and a member type that includes a member are targeted as four types of program elements, and these program elements appear as external references. It is defined as an attribute of an entity, and the cluster definition analysis is performed by applying the distance definition in which the difference between the total entropy values before and after clustering two entities is the distance measure between the data elements to be clustered. A program analysis method using cluster analysis.