WO2021038835A1

WO2021038835A1 - Information processing device, and data flow creating program

Info

Publication number: WO2021038835A1
Application number: PCT/JP2019/034153
Authority: WO
Inventors: 貴之北野
Original assignee: 富士通株式会社
Priority date: 2019-08-30
Filing date: 2019-08-30
Publication date: 2021-03-04

Abstract

A database (15) stores information pertaining to groups. A group herein refers to a partial data flow comprising two or more processes and input data of the initial process of the two or more processes to output data of the last process thereof. The output data of the last process may be absent. A search unit (18) then retrieves from the database (15) groups most similar to a data flow to be created. Then, a display unit (19) extracts from the groups retrieved by the search unit (18) a process that differs from the data flow to be created, and causes the process to be displayed on a display device.

Description

情報処理装置及びデータフロー作成プログラムInformation processing device and data flow creation program

　本発明は、情報処理装置及びデータフロー作成プログラムに関する。 The present invention relates to an information processing device and a data flow creation program.

　現在、企業は、業務で蓄積したデータの利活用を積極的に進めている。データの利活用では、データサイエンティストは、データの処理の流れを示すデータフローを用いて、データ利活用のための分析を行う。 Currently, companies are actively promoting the utilization of data accumulated in their business. In data utilization, data scientists perform analysis for data utilization using data flows that show the flow of data processing.

　なお、操作命令情報を操作命令情報記憶手段に蓄積し、入力された操作情報に対応して実行すべき命令情報を操作命令情報記憶手段から検索し、検索した命令を実行する従来技術がある。 It should be noted that there is a conventional technique in which operation command information is stored in the operation command information storage means, command information to be executed corresponding to the input operation information is searched from the operation command information storage means, and the searched command is executed.

　また、フロー情報記憶部に格納されたフロー情報から検索用情報を生成し、検索要求を受け付けると該検索要求に含まれる検索条件で検索用情報を検索し、検索条件に合致する検索用情報からフロー情報を取得することで、フロー情報の検索を高速にする技術がある。 In addition, search information is generated from the flow information stored in the flow information storage unit, and when a search request is received, the search information is searched by the search conditions included in the search request, and from the search information that matches the search conditions. There is a technology to speed up the search of flow information by acquiring the flow information.

特開平１１－２４２６００号公報Japanese Unexamined Patent Publication No. 11-242600 特開２０１０－６８２７９号公報Japanese Unexamined Patent Publication No. 2010-68279

　データフローを作成する場合、データサイエンティストは、過去に作成されたデータフローを参考にするが、参考にすることができるデータフローを探し出すことが困難であるという問題がある。 When creating a data flow, the data scientist refers to the data flow created in the past, but there is a problem that it is difficult to find a data flow that can be referred to.

　本発明は、１つの側面では、作成対象のデータフローに対して有用なリコメンド用の要素の出力を可能にすることを目的とする。 One aspect of the present invention is to enable the output of elements for recommendation that are useful for the data flow to be created.

　１つの態様では、情報処理装置は、データベースと抽出部と出力部とを有する。前記データベースは、処理と、処理に使われるデータ及び処理結果として得られるデータとを要素として含む一連のデータフローを蓄積する。前記抽出部は、作成対象のデータフローに類似するデータフローを前記データベースから抽出する。前記出力部は、前記抽出部により抽出されたデータフローから作成対象のデータフローと相違する要素を抽出し、抽出した要素を出力する。 In one aspect, the information processing device has a database, an extraction unit, and an output unit. The database accumulates a series of data flows including processing, data used for processing, and data obtained as a result of processing as elements. The extraction unit extracts a data flow similar to the data flow to be created from the database. The output unit extracts an element different from the data flow to be created from the data flow extracted by the extraction unit, and outputs the extracted element.

　本発明は、１つの側面では、作成対象のデータフローに対して有用なリコメンド用の要素の出力を可能にすることができる。 In one aspect, the present invention can enable the output of elements for recommendation that are useful for the data flow to be created.

図１Ａは、データベースの作成に用いられる複数のデータフローを示す図である。FIG. 1A is a diagram showing a plurality of data flows used for creating a database. 図１Ｂは、１個目のグループ組み合わせを示す図である。FIG. 1B is a diagram showing the first group combination. 図１Ｃは、２個目のグループ組み合わせを示す図である。FIG. 1C is a diagram showing a second group combination. 図１Ｄは、６８個目のグループ組み合わせを示す図である。FIG. 1D is a diagram showing the 68th group combination. 図１Ｅは、１１７個目（最後）のグループ組み合わせを示す図である。FIG. 1E is a diagram showing the 117th (last) group combination. 図１Ｆは、作成中のデータフローを示す図である。FIG. 1F is a diagram showing a data flow being created. 図１Ｇは、リコメンド画面を示す図である。FIG. 1G is a diagram showing a recommendation screen. 図２Ａは、データベースの作成に用いられる複数のデータフローを示す図である。FIG. 2A is a diagram showing a plurality of data flows used for creating a database. 図２Ｂは、１個目のグループ組み合わせを示す図である。FIG. 2B is a diagram showing the first group combination. 図２Ｃは、２個目のグループ組み合わせを示す図である。FIG. 2C is a diagram showing a second group combination. 図２Ｄは、１３０個目のグループ組み合わせを示す図である。FIG. 2D is a diagram showing the 130th group combination. 図２Ｅは、２９８個目（最後）のグループ組み合わせを示す図である。FIG. 2E is a diagram showing the 298th (last) group combination. 図２Ｆは、作成中のデータフローを示す図である。FIG. 2F is a diagram showing a data flow being created. 図２Ｇは、リコメンド画面を示す図である。FIG. 2G is a diagram showing a recommendation screen. 図３は、実施例に係る情報処理装置の機能構成を示す図である。FIG. 3 is a diagram showing a functional configuration of the information processing apparatus according to the embodiment. 図４は、データフロー記憶部の一例を示す図である。FIG. 4 is a diagram showing an example of a data flow storage unit. 図５は、グループ記憶部の一例を示す図である。FIG. 5 is a diagram showing an example of a group storage unit. 図６は、データベースの一例を示す図である。FIG. 6 is a diagram showing an example of a database. 図７は、作成フロー記憶部の一例を示す図である。FIG. 7 is a diagram showing an example of the creation flow storage unit. 図８は、情報処理装置による処理のフローを示すフローチャートである。FIG. 8 is a flowchart showing a processing flow by the information processing apparatus. 図９は、実施例に係るデータフロー作成プログラムを実行するコンピュータのハードウェア構成を示す図である。FIG. 9 is a diagram showing a hardware configuration of a computer that executes a data flow creation program according to an embodiment.

　以下に、本願の開示する情報処理装置及びデータフロー作成プログラムの実施例を図面に基づいて詳細に説明する。なお、この実施例は開示の技術を限定するものではない。 Hereinafter, examples of the information processing apparatus and the data flow creation program disclosed in the present application will be described in detail based on the drawings. It should be noted that this embodiment does not limit the disclosed technology.

　まず、実施例に係る情報処理装置が行うリコメンドの例を図１Ａ～図１Ｇを用いて説明する。図１Ａ～図１Ｇにおいて、楕円のアイコンはプロセス（処理）を表し、カードのアイコンはデータを表す。データはｃｓｖ（comma-separated　values）ファイルである。「Ｐｙｔｈｏｎ」は、プログラミング言語であり、楕円の中の「Ｐｙｔｈｏｎ」は、プロセスが「Ｐｙｔｈｏｎ」で作成されているＰｙｔｈｏｎプログラムで実現されることを示す。 First, an example of the recommendation performed by the information processing apparatus according to the embodiment will be described with reference to FIGS. 1A to 1G. In FIGS. 1A-1G, the elliptical icon represents the process and the card icon represents the data. The data is a csv (comma-separated values) file. "Python" is a programming language, and "Python" in the ellipse indicates that the process is realized by a Python program created by "Python".

　実施例に係る情報処理装置は、図１Ａ～図１Ｅに示すように、複数のデータフローを用いて、複数のプロセスを含む部分データフローのメタデータと頻出度を特定し、特定したメタデータと頻出度をデータベースに記憶する。ここで、メタデータは部分データフローに紐づけられる情報であり、メタデータの詳細については後述する。また、頻出度は、部分データフローが使われた頻度を示す値である。 As shown in FIGS. 1A to 1E, the information processing apparatus according to the embodiment uses a plurality of data flows to specify the metadata and the frequency of the partial data flow including the plurality of processes, and the specified metadata and the specified metadata. Store the frequency in the database. Here, the metadata is information associated with the partial data flow, and the details of the metadata will be described later. The frequency is a value indicating the frequency with which the partial data flow is used.

　そして、実施例に係る情報処理装置は、図１Ｆ～図１Ｇに示すように、作成中のデータフローに最も類似し、かつ、作成中のデータフローよりプロセス数の多い部分データフローをメタデータと頻出度を用いてデータベースから検索し、リコメンド対象の要素の抽出を行う。 Then, as shown in FIGS. 1F to 1G, the information processing apparatus according to the embodiment uses a partial data flow that is most similar to the data flow being created and has a larger number of processes than the data flow being created as metadata. Search from the database using the frequency and extract the elements to be recommended.

　図１Ａは、データベースの作成に用いられる複数のデータフローを示す図である。ここでは、データフローＡ～データフローＤで表される４つのデータフローがデータベースの作成に用いられる。実施例に係る情報処理装置は、データフローＡにおいて、「Ｄａｔａ２．ｃｓｖ」と「Ｄａｔａ１．ｃｓｖ」の統計的な差異として「行数の減少」を特定する。また、実施例に係る情報処理装置は、データフローＡにおいて、「Ｄａｔａ３．ｃｓｖ」と「Ｄａｔａ２．ｃｓｖ」の統計的な差異として「値の数の増加」を特定する。統計的な差異としては、他に「行数の増加」、「値の数の減少」、「値の範囲の減少」、「値の範囲の増加」、「値の種類の減少」、「値の種類の増加」、「新しい列の算出」等がある。実施例に係る情報処理装置は、これらの統計的な差異を、入力データと出力データを比較することで特定する。 FIG. 1A is a diagram showing a plurality of data flows used for creating a database. Here, four data flows represented by data flow A to data flow D are used for creating a database. The information processing apparatus according to the embodiment specifies "decrease in the number of rows" as a statistical difference between "Data2.csv" and "Data1.csv" in the data flow A. Further, the information processing apparatus according to the embodiment specifies "increase in the number of values" as a statistical difference between "Data3.csv" and "Data2.csv" in the data flow A. Other statistical differences include "increase in number of rows", "decrease in number of values", "decrease in range of values", "increase in range of values", "decrease in value types", and "value". There are "increase of types", "calculation of new columns", etc. The information processing apparatus according to the embodiment identifies these statistical differences by comparing the input data and the output data.

　そして、実施例に係る情報処理装置は、「Ｄａｔａ２．ｃｓｖ」と「Ｄａｔａ１．ｃｓｖ」の統計的な差異「行数の減少」を生み出すプロセス「Ｐｙｔｈｏｎ１」のアルゴリズムとして「削除」を特定する。統計的な差異とアルゴリズムの組み合わせは、メタデータの一例である。特定されたアルゴリズムは、プロセスの下に表示される。統計的な差異「行数の減少」を生み出すプロセスのアルゴリズムとしては、「削除」の他に「外れ値除外」がある。「削除」であるか「外れ値除外」であるかは、入力データと出力データを比較することで特定される。また、実施例に係る情報処理装置は、「Ｄａｔａ３．ｃｓｖ」と「Ｄａｔａ２．ｃｓｖ」の統計的な差異「値の数の増加」を生み出すプロセス「Ｐｙｔｈｏｎ２」のアルゴリズムとして「補間」を特定する。 Then, the information processing apparatus according to the embodiment specifies "deletion" as an algorithm of the process "Phython 1" that produces a statistical difference "decrease in the number of lines" between "Data2.csv" and "Data1.csv". The combination of statistical differences and algorithms is an example of metadata. The identified algorithm is displayed below the process. In addition to "deletion", there is "outlier exclusion" as an algorithm of the process that produces the statistical difference "decrease in the number of rows". Whether it is "deleted" or "outlier excluded" is specified by comparing the input data and the output data. Further, the information processing apparatus according to the embodiment specifies "interpolation" as an algorithm of the process "Phython 2" that produces a statistical difference "increase in the number of values" between "Data3.csv" and "Data2.csv".

　同様に、実施例に係る情報処理装置は、「Ｄａｔａ４．ｃｓｖ」と「Ｄａｔａ３．ｃｓｖ」の統計的な差異を生み出すプロセス「Ｐｙｔｈｏｎ３」のアルゴリズムとして「正規化」を特定する。また、実施例に係る情報処理装置は、「Ｄａｔａ５．ｃｓｖ」と「Ｄａｔａ４．ｃｓｖ」の統計的な差異を生み出すプロセス「Ｐｙｔｈｏｎ４」のアルゴリズムが不明であるので、アルゴリズムを「不明」とする。また、実施例に係る情報処理装置は、データフローＢにおいて、他のアルゴリズムとして「名寄せ」を特定する。 Similarly, the information processing apparatus according to the embodiment specifies "normalization" as an algorithm of the process "Phython3" that produces a statistical difference between "Data4.csv" and "Data3.csv". Further, in the information processing apparatus according to the embodiment, the algorithm of the process "Phython 4" that produces a statistical difference between "Data5.csv" and "Data4.csv" is unknown, so the algorithm is set to "unknown". Further, the information processing apparatus according to the embodiment specifies "name identification" as another algorithm in the data flow B.

　実施例に係る情報処理装置は、２つ以上のプロセスと２つ以上のプロセスの先頭のプロセスの入力データから最後のプロセスの出力データまでのデータとを含む部分データフローをグループとして全てのデータフローから全て抽出する。ただし、最後のプロセスの出力データはない場合もある。そして、実施例に係る情報処理装置は、異なるデータフローに含まれる２つのグループについて、統計的な差異とアルゴリズムを特定し、対応する統計的な差異と、対応するアルゴリズムが一致するか否かを判定する。 The information processing apparatus according to the embodiment is a group of partial data flows including two or more processes and data from the input data of the first process of the two or more processes to the output data of the last process. Extract everything from. However, there may be no output data for the last process. Then, the information processing apparatus according to the embodiment identifies statistical differences and algorithms for two groups included in different data flows, and determines whether or not the corresponding statistical differences and the corresponding algorithms match. judge.

　そして、対応する統計的な差異と、対応するアルゴリズムが一致する場合に、実施例に係る情報処理装置は、２つのグループは同一であると判定し、グループの頻出度に１を加える。そして、実施例に係る情報処理装置は、一致した統計的な差異、アルゴリズム、プロセス数、２つのグループのＰｙｔｈｏｎプログラム名を頻出度と紐づけてデータベースに記憶する。また、実施例に係る情報処理装置は、２つのグループが同一であるか否かの判定をグループの全ての組み合わせについて行う。 Then, when the corresponding statistical difference and the corresponding algorithm match, the information processing apparatus according to the embodiment determines that the two groups are the same, and adds 1 to the frequency of the groups. Then, the information processing apparatus according to the embodiment stores the matching statistical difference, algorithm, number of processes, and two groups of Python program names in the database in association with the frequency. Further, the information processing apparatus according to the embodiment determines whether or not the two groups are the same for all combinations of the groups.

　例えば、実施例に係る情報処理装置は、図１Ｂに示すように、データフローＡから、「Ｄａｔａ１．ｃｓｖ→Ｐｙｔｈｏｎ１→Ｄａｔａ２．ｃｓｖ→Ｐｙｔｈｏｎ２→Ｄａｔａ３．ｃｓｖ」をグループＡ１として抽出する。ここで、「グループＡ１」は、グループを識別するグループ番号が「Ａ１」であるグループである。また、実施例に係る情報処理装置は、データフローＢから、「Ｄａｔａ１．ｃｓｖ→Ｐｙｔｈｏｎ１→Ｄａｔａ２．ｃｓｖ→Ｐｙｔｈｏｎ２→Ｄａｔａ３．ｃｓｖ」をグループＢ１として抽出する。 For example, as shown in FIG. 1B, the information processing apparatus according to the embodiment extracts "Data1.csv-> Python1-> Data2.csv-> Python2-> Data3.csv" from the data flow A as a group A1. Here, "group A1" is a group whose group number for identifying the group is "A1". Further, the information processing apparatus according to the embodiment extracts "Data1.csv-> Python1-> Data2.csv-> Python2-> Data3.csv" from the data flow B as a group B1.

　そして、実施例に係る情報処理装置は、グループＡ１において、「Ｄａｔａ２．ｃｓｖ」と「Ｄａｔａ１．ｃｓｖ」の統計的な差異として「行数の減少」を特定する。また、実施例に係る情報処理装置は、グループＡ１において、「Ｄａｔａ３．ｃｓｖ」と「Ｄａｔａ２．ｃｓｖ」の統計的な差異として「値の数の増加」を特定する。また、実施例に係る情報処理装置は、統計的な差異「行数の減少」を生み出すアルゴリズムとして「削除」を特定し、統計的な差異「値の数の増加」を生み出すアルゴリズムとして「補間」を特定する。 Then, the information processing apparatus according to the embodiment specifies "decrease in the number of rows" as a statistical difference between "Data2.csv" and "Data1.csv" in group A1. Further, the information processing apparatus according to the embodiment specifies "increase in the number of values" as a statistical difference between "Data3.csv" and "Data2.csv" in group A1. Further, the information processing apparatus according to the embodiment specifies "deletion" as an algorithm that produces a statistical difference "decrease in the number of rows", and "interpolates" as an algorithm that produces a statistical difference "increase in the number of values". To identify.

　同様に、実施例に係る情報処理装置は、グループＢ１において、「Ｄａｔａ２．ｃｓｖ」と「Ｄａｔａ１．ｃｓｖ」の統計的な差異として「行数の減少」を特定する。また、実施例に係る情報処理装置は、グループＢ１において、「Ｄａｔａ３．ｃｓｖ」と「Ｄａｔａ２．ｃｓｖ」の統計的な差異として「値の数の増加」を特定する。また、実施例に係る情報処理装置は、統計的な差異「行数の減少」を生み出すアルゴリズムとして「削除」を特定し、統計的な差異「値の数の増加」を生み出すアルゴリズムとして「補間」を特定する。 Similarly, the information processing apparatus according to the embodiment specifies "decrease in the number of rows" as a statistical difference between "Data2.csv" and "Data1.csv" in group B1. Further, the information processing apparatus according to the embodiment specifies "increase in the number of values" as a statistical difference between "Data3.csv" and "Data2.csv" in group B1. Further, the information processing apparatus according to the embodiment specifies "deletion" as an algorithm that produces a statistical difference "decrease in the number of rows", and "interpolates" as an algorithm that produces a statistical difference "increase in the number of values". To identify.

　グループＡ１とグループＢ１では、対応する統計的な差異が「行数の減少」と「値の数の増加」で同じであり、対応するアルゴリズムも「削除」と「補間」で同じである。したがって、実施例に係る情報処理装置は、アルゴリズムが「削除→補間」であり、統計的な差異が「行数の減少→値の数の増加」で表されるグループの頻出度に１を加える。そして、実施例に係る情報処理装置は、アルゴリズムとして「削除→補間」を、統計的な差異として「行数の減少→値の数の増加」を、プロセス数として「２」を、頻出度として「１」を、データベースに保存する。また、実施例に係る情報処理装置は、グループＡ１のＰｙｔｈｏｎプログラムとして「Ｐｙｔｈｏｎ１→Ｐｙｔｈｏｎ２」を、グループＢ１のＰｙｔｈｏｎプログラムとして「Ｐｙｔｈｏｎ１→Ｐｙｔｈｏｎ２」を、データベースに保存する。 In group A1 and group B1, the corresponding statistical differences are the same for "decrease in the number of rows" and "increase in the number of values", and the corresponding algorithms are also the same for "delete" and "interpolation". Therefore, in the information processing apparatus according to the embodiment, 1 is added to the frequency of the group in which the algorithm is "deletion-> interpolation" and the statistical difference is "decrease in the number of rows-> increase in the number of values". .. Then, the information processing apparatus according to the embodiment uses "deletion-> interpolation" as an algorithm, "decrease in the number of rows-> increase in the number of values" as a statistical difference, and "2" as the number of processes as the frequency. Save "1" in the database. Further, the information processing apparatus according to the embodiment stores "Python1 → Python2" as the Python program of the group A1 and "Python1 → Python2" as the Python program of the group B1 in the database.

　実施例に係る情報処理装置は、次に、２個目の組み合わせとして、図１Ｃに示すように、グループＡ１とグループＢ２を抽出する。グループＡ１とグループＢ２は、プロセスの数が異なるため、実施例に係る情報処理装置は、異なるグループと判定して、次のグループを抽出する。そして、実施例に係る情報処理装置は、同様の判定を繰り返していき、６８個目の組み合わせとして、図１Ｄに示すように、グループＡ５とグループＤ２を抽出する。 Next, the information processing apparatus according to the embodiment extracts group A1 and group B2 as the second combination as shown in FIG. 1C. Since the number of processes is different between the group A1 and the group B2, the information processing apparatus according to the embodiment determines that they are different groups and extracts the next group. Then, the information processing apparatus according to the embodiment repeats the same determination, and extracts group A5 and group D2 as the 68th combination as shown in FIG. 1D.

　そして、実施例に係る情報処理装置は、グループＡ５において、「Ｄａｔａ３．ｃｓｖ」と「Ｄａｔａ２．ｃｓｖ」の統計的な差異として「値の数の増加」を特定する。また、実施例に係る情報処理装置は、グループＡ５において、「Ｄａｔａ４．ｃｓｖ」と「Ｄａｔａ３．ｃｓｖ」の統計的な差異として「値の範囲の変更」を特定する。また、実施例に係る情報処理装置は、グループＡ５において、「Ｄａｔａ５．ｃｓｖ」と「Ｄａｔａ４．ｃｓｖ」の統計的な差異として「新しい列の算出」を特定する。 Then, the information processing apparatus according to the embodiment specifies "increase in the number of values" as a statistical difference between "Data3.csv" and "Data2.csv" in group A5. Further, the information processing apparatus according to the embodiment specifies "change of value range" as a statistical difference between "Data4.csv" and "Data3.csv" in group A5. In addition, the information processing apparatus according to the embodiment specifies "calculation of a new column" as a statistical difference between "Data5.csv" and "Data4.csv" in group A5.

　また、実施例に係る情報処理装置は、統計的な差異「値の数の増加」を生み出すアルゴリズムとして「補間」を特定する。また、実施例に係る情報処理装置は、統計的な差異「値の範囲の変更」を生み出すアルゴリズムとして「正規化」を特定する。また、実施例に係る情報処理装置は、統計的な差異「新しい列の算出」を生み出すアルゴリズムが不明のため、アルゴリズムを「不明」とし、「Ｐｙｔｈｏｎ４」でインポート（import）しているライブラリ名を抽出する。 In addition, the information processing apparatus according to the embodiment specifies "interpolation" as an algorithm that produces a statistical difference "increase in the number of values". Further, the information processing apparatus according to the embodiment specifies "normalization" as an algorithm that produces a statistical difference "change in the range of values". Further, in the information processing apparatus according to the embodiment, since the algorithm that produces the statistical difference "calculation of a new column" is unknown, the algorithm is set to "unknown" and the library name imported by "Phython 4" is used. Extract.

　同様に、実施例に係る情報処理装置は、グループＤ２において、「Ｄａｔａ２．ｃｓｖ」と「Ｄａｔａ１．ｃｓｖ」の統計的な差異として「値の数の増加」を特定する。また、実施例に係る情報処理装置は、グループＤ２において、「Ｄａｔａ３．ｃｓｖ」と「Ｄａｔａ２．ｃｓｖ」の統計的な差異として「値の範囲の変更」を特定する。また、実施例に係る情報処理装置は、グループＤ２において、「Ｄａｔａ４．ｃｓｖ」と「Ｄａｔａ３．ｃｓｖ」の統計的な差異として「新しい列の算出」を特定する。 Similarly, the information processing apparatus according to the embodiment specifies "increase in the number of values" as a statistical difference between "Data2.csv" and "Data1.csv" in group D2. Further, the information processing apparatus according to the embodiment specifies "change of value range" as a statistical difference between "Data3.csv" and "Data2.csv" in group D2. Further, the information processing apparatus according to the embodiment specifies "calculation of a new column" as a statistical difference between "Data4.csv" and "Data3.csv" in group D2.

　また、実施例に係る情報処理装置は、統計的な差異「値の数の増加」を生み出すアルゴリズムとして「補間」を特定する。また、実施例に係る情報処理装置は、統計的な差異「値の範囲の変更」を生み出すアルゴリズムとして「正規化」を特定する。また、実施例に係る情報処理装置は、統計的な差異「新しい列の算出」を生み出すアルゴリズムが不明のため、アルゴリズムを「不明」とし、「Ｐｙｔｈｏｎ３」でインポートしているライブラリ名を抽出する。 In addition, the information processing apparatus according to the embodiment specifies "interpolation" as an algorithm that produces a statistical difference "increase in the number of values". Further, the information processing apparatus according to the embodiment specifies "normalization" as an algorithm that produces a statistical difference "change in the range of values". Further, in the information processing apparatus according to the embodiment, since the algorithm that produces the statistical difference "calculation of a new column" is unknown, the algorithm is set to "unknown" and the library name imported by "Phython 3" is extracted.

　グループＡ５とグループＤ２では、対応する統計的な差異が「値の数の増加」と「値の範囲の変更」と「新しい列の算出」で同じであり、対応するアルゴリズムは「補間」と「正規化」が同じである。また、統計的な差異「新しい列の算出」を生み出すアルゴリズムは「不明」であるため、実施例に係る情報処理装置は、ライブラリ名が一致している割合が０．８を超えているか否かを判定する。そして、ライブラリ名が一致している割合が０．８を超えている場合に、実施例に係る情報処理装置は、統計的な差異「新しい列の算出」が同じように生み出されたと判定し、グループＡ５とグループＤ２は一致すると判定する。 In groups A5 and D2, the corresponding statistical differences are the same for "increasing the number of values", "changing the range of values" and "calculating a new column", and the corresponding algorithms are "interpolation" and "interpolation". "Normalization" is the same. In addition, since the algorithm that produces the statistical difference "calculation of new column" is "unknown", whether or not the ratio of matching library names exceeds 0.8 in the information processing apparatus according to the embodiment. To judge. Then, when the ratio of matching library names exceeds 0.8, the information processing apparatus according to the embodiment determines that the statistical difference "calculation of a new column" has been created in the same manner. It is determined that the group A5 and the group D2 match.

　したがって、実施例に係る情報処理装置は、アルゴリズムが「補間→正規化→ライブラリ名の一致割合０．８超」であり、統計的な差異が「値の数の増加→値の範囲の変更→新しい列の算出」で表されるグループの頻出度に１を加える。そして、実施例に係る情報処理装置は、アルゴリズムとして「補間→正規化→ライブラリ名の一致割合０．８超」を、統計的な差異として「値の数の増加→値の範囲の変更→新しい列の算出」を、プロセス数として「３」を、データベースに保存する。また、実施例に係る情報処理装置は、Ａ５のＰｙｔｈｏｎプログラムとして「Ｐｙｔｈｏｎ２→Ｐｙｔｈｏｎ３→Ｐｙｔｈｏｎ４」を、Ｄ２のＰｙｔｈｏｎプログラムとして「Ｐｙｔｈｏｎ１→Ｐｙｔｈｏｎ２→Ｐｙｔｈｏｎ３」を、データベースに保存する。また、実施例に係る情報処理装置は、頻出度として「１」をデータベースに保存する。 Therefore, in the information processing apparatus according to the embodiment, the algorithm is "interpolation-> normalization-> library name match ratio of more than 0.8", and the statistical difference is "increase in the number of values-> change the range of values->". Add 1 to the frequency of the group represented by "Calculation of new column". Then, the information processing apparatus according to the embodiment uses "interpolation-> normalization-> library name match ratio of more than 0.8" as an algorithm and "increases the number of values-> changes the range of values-> new" as a statistical difference. "Calculate column" and "3" as the number of processes are saved in the database. Further, the information processing apparatus according to the embodiment stores "Python2-> Python3-> Python4" as the A5 Python program and "Python1-> Python2-> Python3" as the D2 Python program in the database. Further, the information processing apparatus according to the embodiment stores "1" as the frequency of occurrence in the database.

　そして、実施例に係る情報処理装置は、同様の判定を繰り返していき、１１７個目の組み合わせ（グループの総当たりの最後の組み合わせ）として、図１Ｅに示すように、グループＣ３とグループＤ３を抽出し、同様の処理を行う。 Then, the information processing apparatus according to the embodiment repeats the same determination, and extracts group C3 and group D3 as the 117th combination (the last combination of group round robin) as shown in FIG. 1E. Then, perform the same processing.

　以上の処理の結果として、実施例に係る情報処理装置は、グループＡ１とグループＢ１、グループＡ４とグループＤ１、グループＡ６とグループＤ３、グループＢ６とグループＣ３、グループＡ５とグループＤ２の頻出度を１、他のグループの頻出度を０と特定する。 As a result of the above processing, the information processing apparatus according to the embodiment has a frequency of 1 for group A1 and group B1, group A4 and group D1, group A6 and group D3, group B6 and group C3, and group A5 and group D2. , The frequency of occurrence of other groups is specified as 0.

　そして、実施例に係る情報処理装置は、作成中のデータフローに対してデータベースを参照してリコメンドを行う。図１Ｆは、作成中のデータフローを示す図である。実施例に係る情報処理装置は、図１Ｆに示すデータフローＵにおいて、「Ｄａｔａ２．ｃｓｖ」と「Ｄａｔａ１．ｃｓｖ」の統計的な差異として「値の数の増加」を特定する。また、実施例に係る情報処理装置は、データフローＵにおいて、「Ｄａｔａ３．ｃｓｖ」と「Ｄａｔａ２．ｃｓｖ」の統計的な差異として「新しい列の算出」を特定する。また、実施例に係る情報処理装置は、統計的な差異「値の数の増加」を生み出すアルゴリズムとして「補間」を特定し、統計的な差異「新しい列の算出」を生み出すアルゴリズムを「不明」とし、「Ｐｙｔｈｏｎ２」からインポートしているライブラリ名を抽出する。 Then, the information processing device according to the embodiment refers to the database and recommends the data flow being created. FIG. 1F is a diagram showing a data flow being created. The information processing apparatus according to the embodiment specifies "increase in the number of values" as a statistical difference between "Data2.csv" and "Data1.csv" in the data flow U shown in FIG. 1F. Further, the information processing apparatus according to the embodiment specifies "calculation of a new column" as a statistical difference between "Data3.csv" and "Data2.csv" in the data flow U. In addition, the information processing apparatus according to the embodiment specifies "interpolation" as an algorithm that produces a statistical difference "increase in the number of values", and "unknown" an algorithm that produces a statistical difference "calculation of a new column". Then, the name of the library being imported is extracted from "Phython2".

　データフローＵのプロセス数は「２」であり、統計的差異は「値の数の増加」と「新し列の算出」であり、アルゴリズムは「補間」と「不明」である。このため、実施例に係る情報処理装置は、データベースに蓄積されたグループの中で、以下の条件を満たす最大のグループを特定する。
　・３個以上のプロセスを有す
　・頻出度が閾値以上（例えば、閾値は１）
　・統計的差異に「値の数の増加」と「新し列の算出」を含み、アルゴリズムに「補間」と「不明」を含む
　・「不明」のＰｙｔｈｏｎプログラムがインポートしているライブラリの名前の一致割合が０．８を超える The number of processes in the data flow U is "2", the statistical differences are "increase in the number of values" and "calculate new columns", and the algorithms are "interpolate" and "unknown". Therefore, the information processing apparatus according to the embodiment specifies the largest group that satisfies the following conditions among the groups stored in the database.
・ Has 3 or more processes ・ Frequency is above the threshold (for example, the threshold is 1)
-Statistical differences include "increase in number of values" and "calculate new columns", algorithms include "interpolation" and "unknown"-"Unknown" Python program imports the name of the library Match rate exceeds 0.8

　実施例に係る情報処理装置は、上記条件を満たす最大のグループとしてグループＤ２を特定し、Ｄ２のプロセスの中で作成中のデータフローにはないプロセスを実現するプログラムとして「Ｐｙｔｈｏｎ２」を特定してリコメンドする。図１Ｇは、リコメンド画面を示す図である。図１Ｇに示すように、実施例に係る情報処理装置は、Ｄ２に基づいて、作成中のデータフローの「補間」と「不明」の間に、「正規化」を挿入することをリコメンドする。実施例に係る情報処理装置は、リコメンドするプロセスを、入出力データとともに、例えば、緑色の枠を付けて表示する。 The information processing apparatus according to the embodiment specifies group D2 as the largest group satisfying the above conditions, and specifies "Phython 2" as a program that realizes a process that is not in the data flow being created in the process of D2. Recommend. FIG. 1G is a diagram showing a recommendation screen. As shown in FIG. 1G, the information processing apparatus according to the embodiment recommends inserting "normalization" between "interpolation" and "unknown" of the data flow being created based on D2. The information processing apparatus according to the embodiment displays the recommended process together with the input / output data, for example, with a green frame.

　このように、実施例に係る情報処理装置は、参考となるデータフローをデータベースから検索して表示するので、データサイエンティストによるデータフローの作成を支援することができる。 In this way, the information processing apparatus according to the embodiment searches the database for a reference data flow and displays it, so that it is possible to support the creation of the data flow by the data scientist.

　次に、他のリコメンド例について図２Ａ～図２Ｇを用いて説明する。図２Ａは、データベースの作成に用いられる複数のデータフローを示す図である。この例では、データフローＡＡ、データフローＢ、データフローＣ及びデータフローＤＤで表される４つのデータフローがデータベースの作成に用いられる。 Next, other recommended examples will be described with reference to FIGS. 2A to 2G. FIG. 2A is a diagram showing a plurality of data flows used for creating a database. In this example, four data flows represented by data flow AA, data flow B, data flow C and data flow DD are used to create the database.

　実施例に係る情報処理装置は、図２Ｂに示すように、データフローＡＡから、「Ｄａｔａ１．ｃｓｖ→Ｐｙｔｈｏｎ１→Ｄａｔａ２．ｃｓｖ→Ｐｙｔｈｏｎ２→Ｄａｔａ３．ｃｓｖ」をグループＡＡ１として抽出する。また、実施例に係る情報処理装置は、データフローＢから、「Ｄａｔａ１．ｃｓｖ→Ｐｙｔｈｏｎ１→Ｄａｔａ２．ｃｓｖ→Ｐｙｔｈｏｎ２→Ｄａｔａ３．ｃｓｖ」をグループＢ１として抽出する。 As shown in FIG. 2B, the information processing apparatus according to the embodiment extracts "Data1.csv-> Python1-> Data2.csv-> Python2-> Data3.csv" from the data flow AA as a group AA1. Further, the information processing apparatus according to the embodiment extracts "Data1.csv-> Python1-> Data2.csv-> Python2-> Data3.csv" from the data flow B as a group B1.

　そして、実施例に係る情報処理装置は、グループＡＡ１において、「Ｄａｔａ２．ｃｓｖ」と「Ｄａｔａ１．ｃｓｖ」の統計的な差異として「行数の減少」を特定する。また、実施例に係る情報処理装置は、グループＡＡ１において、「Ｄａｔａ３．ｃｓｖ」と「Ｄａｔａ２．ｃｓｖ」の統計的な差異として「値の数の増加」を特定する。また、実施例に係る情報処理装置は、統計的な差異「行数の減少」を生み出すアルゴリズムとして「削除」を特定し、統計的な差異「値の数の増加」を生み出すアルゴリズムとして「補間」を特定する。 Then, the information processing apparatus according to the embodiment specifies "decrease in the number of rows" as a statistical difference between "Data2.csv" and "Data1.csv" in group AA1. In addition, the information processing apparatus according to the embodiment specifies "increase in the number of values" as a statistical difference between "Data3.csv" and "Data2.csv" in group AA1. Further, the information processing apparatus according to the embodiment specifies "deletion" as an algorithm that produces a statistical difference "decrease in the number of rows", and "interpolates" as an algorithm that produces a statistical difference "increase in the number of values". To identify.

　グループＡＡ１とグループＢ１では、対応する統計的な差異が「行数の減少」と「値の数の増加」で同じであり、対応するアルゴリズムも「削除」と「補間」で同じである。したがって、実施例に係る情報処理装置は、アルゴリズムが「削除→補間」であり、統計的な差異が「行数の減少→値の数の増加」で表されるグループの頻出度に１を加える。そして、実施例に係る情報処理装置は、アルゴリズムとして「削除→補間」を、統計的な差異として「行数の減少→値の数の増加」を、プロセス数として「２」を、頻出度として「１」を、データベースに保存する。また、実施例に係る情報処理装置は、グループＡＡ１のＰｙｔｈｏｎプログラムとして「Ｐｙｔｈｏｎ１→Ｐｙｔｈｏｎ２」を、グループＢ１のＰｙｔｈｏｎプログラムとして「Ｐｙｔｈｏｎ１→Ｐｙｔｈｏｎ２」を、データベースに保存する。 In group AA1 and group B1, the corresponding statistical differences are the same for "decrease in the number of rows" and "increase in the number of values", and the corresponding algorithms are also the same for "delete" and "interpolation". Therefore, in the information processing apparatus according to the embodiment, 1 is added to the frequency of the group in which the algorithm is "deletion-> interpolation" and the statistical difference is "decrease in the number of rows-> increase in the number of values". .. Then, the information processing apparatus according to the embodiment uses "deletion-> interpolation" as an algorithm, "decrease in the number of rows-> increase in the number of values" as a statistical difference, and "2" as the number of processes as the frequency. Save "1" in the database. Further, the information processing apparatus according to the embodiment stores "Python1 → Python2" as the Python program of the group AA1 and "Phython1 → Python2" as the Python program of the group B1 in the database.

　実施例に係る情報処理装置は、次に、２個目の組み合わせとして、図２Ｃに示すように、グループＡＡ１とグループＢ２を抽出する。グループＡＡ１とグループＢ２は、プロセスの数が異なるため、実施例に係る情報処理装置は、異なるグループと判定して、次のグループを抽出する。そして、実施例に係る情報処理装置は、同様の判定を繰り返していき、１３０個目の組み合わせとして、図２Ｄに示すように、グループＡＡ７とグループＤＤ７を抽出する。 Next, the information processing apparatus according to the embodiment extracts group AA1 and group B2 as the second combination as shown in FIG. 2C. Since the number of processes is different between the group AA1 and the group B2, the information processing apparatus according to the embodiment determines that they are different groups and extracts the next group. Then, the information processing apparatus according to the embodiment repeats the same determination, and extracts group AA7 and group DD7 as the 130th combination as shown in FIG. 2D.

　そして、実施例に係る情報処理装置は、グループＡＡ７において、「Ｄａｔａ３．ｃｓｖ」と「Ｄａｔａ２．ｃｓｖ」の統計的な差異として「値の数の増加」を特定する。また、実施例に係る情報処理装置は、グループＡＡ７において、「Ｄａｔａ４．ｃｓｖ」と「Ｄａｔａ３．ｃｓｖ」の統計的な差異として「値の範囲の変更」を特定する。また、実施例に係る情報処理装置は、グループＡＡ７において、「Ｄａｔａ５．ｃｓｖ」と「Ｄａｔａ４．ｃｓｖ」の統計的な差異として「新しい列の算出」を特定する。また、実施例に係る情報処理装置は、グループＡＡ７において、統計的な差異として「出力ファイルなし」を特定する。 Then, the information processing apparatus according to the embodiment specifies "increase in the number of values" as a statistical difference between "Data3.csv" and "Data2.csv" in group AA7. Further, the information processing apparatus according to the embodiment specifies "change of value range" as a statistical difference between "Data4.csv" and "Data3.csv" in group AA7. Further, the information processing apparatus according to the embodiment specifies "calculation of a new column" as a statistical difference between "Data5.csv" and "Data4.csv" in group AA7. Further, the information processing apparatus according to the embodiment specifies "no output file" as a statistical difference in the group AA7.

　また、実施例に係る情報処理装置は、統計的な差異「値の数の増加」を生み出すアルゴリズムとして「補間」を特定する。また、実施例に係る情報処理装置は、統計的な差異「値の範囲の変更」を生み出すアルゴリズムとして「正規化」を特定する。また、実施例に係る情報処理装置は、統計的な差異「新しい列の算出」を生み出すアルゴリズムを「不明」とし、「Ｐｙｔｈｏｎ４」でインポートしているライブラリ名を抽出する。また、実施例に係る情報処理装置は、「出力ファイルなし」を生み出すアルゴリズムとして「グラフ表示」を特定する。 In addition, the information processing apparatus according to the embodiment specifies "interpolation" as an algorithm that produces a statistical difference "increase in the number of values". Further, the information processing apparatus according to the embodiment specifies "normalization" as an algorithm that produces a statistical difference "change in the range of values". Further, the information processing apparatus according to the embodiment sets the algorithm that produces the statistical difference "calculation of a new column" to "unknown", and extracts the library name imported by "Phython 4". Further, the information processing apparatus according to the embodiment specifies "graph display" as an algorithm for producing "no output file".

　同様に、実施例に係る情報処理装置は、グループＤＤ７において、「Ｄａｔａ３．ｃｓｖ」と「Ｄａｔａ２．ｃｓｖ」の統計的な差異として「値の数の増加」を特定する。また、実施例に係る情報処理装置は、グループＤＤ７において、「Ｄａｔａ４．ｃｓｖ」と「Ｄａｔａ３．ｃｓｖ」の統計的な差異として「値の範囲の変更」を特定する。また、実施例に係る情報処理装置は、グループＤＤ７において、「Ｄａｔａ５．ｃｓｖ」と「Ｄａｔａ４．ｃｓｖ」の統計的な差異として「新しい列の算出」を特定する。また、実施例に係る情報処理装置は、グループＤＤ７において、統計的な差異として「出力ファイルなし」を特定する。 Similarly, the information processing apparatus according to the embodiment specifies "increase in the number of values" as a statistical difference between "Data3.csv" and "Data2.csv" in group DD7. Further, the information processing apparatus according to the embodiment specifies "change of value range" as a statistical difference between "Data4.csv" and "Data3.csv" in the group DD7. Further, the information processing apparatus according to the embodiment specifies "calculation of a new column" as a statistical difference between "Data5.csv" and "Data4.csv" in the group DD7. Further, the information processing apparatus according to the embodiment specifies "no output file" as a statistical difference in the group DD7.

　グループＡＡ７とグループＤＤ７では、対応する統計的な差異が「値の数の増加」と「値の範囲の変更」と「新しい列の算出」と「出力ファイルなし」で同じであり、対応するアルゴリズムは「補間」と「正規化」と「グラフ表示」が同じである。また、統計的な差異「新しい列の算出」を生み出すアルゴリズムは不明であるため、実施例に係る情報処理装置は、ライブラリ名が一致している割合が０．８を超えているか否かを判定する。そして、ライブラリ名が一致している割合が０．８を超えている場合に、実施例に係る情報処理装置は、統計的な差異「新しい列の算出」が同じように生み出されたと判定し、グループＡＡ７とグループＤＤ７は一致すると判定する。 In group AA7 and group DD7, the corresponding statistical differences are the same for "increase the number of values", "change the range of values", "calculate a new column" and "no output file", and the corresponding algorithms. "Interpolation", "normalization" and "graph display" are the same. Further, since the algorithm that produces the statistical difference "calculation of a new column" is unknown, the information processing apparatus according to the embodiment determines whether or not the ratio of matching library names exceeds 0.8. To do. Then, when the ratio of matching library names exceeds 0.8, the information processing apparatus according to the embodiment determines that the statistical difference "calculation of a new column" has been created in the same manner. It is determined that the group AA7 and the group DD7 match.

　したがって、実施例に係る情報処理装置は、アルゴリズムが「補間→正規化→ライブラリ名の一致割合０．８超→グラフ表示」で、統計的な差異が「値の数の増加→値の範囲の変更→新しい列の算出→出力ファイルなし」で表されるグループの頻出度に１を加える。そして、実施例に係る情報処理装置は、アルゴリズムとして「補間→正規化→ライブラリ名の一致割合０．８超→グラフ表示」を、統計的な差異として「値の数の増加→値の範囲の変更→新しい列の算出→出力ファイルなし」を、データベースに保存する。また、実施例に係る情報処理装置は、プロセス数として「４」を、ＡＡ７のＰｙｔｈｏｎプログラムとして「Ｐｙｔｈｏｎ２→Ｐｙｔｈｏｎ３→Ｐｙｔｈｏｎ４→Ｐｙｔｈｏｎ５」を、データベースに保存する。また、実施例に係る情報処理装置は、ＤＤ７のＰｙｔｈｏｎプログラムとして「Ｐｙｔｈｏｎ２→Ｐｙｔｈｏｎ３→Ｐｙｔｈｏｎ４→Ｐｙｔｈｏｎ５」を、頻出度として「１」を、データベースに保存する。 Therefore, in the information processing apparatus according to the embodiment, the algorithm is "interpolation-> normalization-> library name match ratio exceeding 0.8-> graph display", and the statistical difference is "increase in the number of values-> the range of values". Add 1 to the frequency of the group represented by "Change-> Calculate new column-> No output file". Then, the information processing apparatus according to the embodiment uses "interpolation-> normalization-> library name match ratio of more than 0.8-> graph display" as an algorithm, and "increase in the number of values-> value range" as a statistical difference. Change-> Calculate new column-> No output file "is saved in the database. Further, the information processing apparatus according to the embodiment stores "4" as the number of processes and "Python2 → Python3 → Python4 → Python5" as the Python program of AA7 in the database. Further, the information processing apparatus according to the embodiment stores "Python 2 → Python 3 → Python 4 → Python 5" as the Python program of DD7 and "1" as the frequency of occurrence in the database.

　そして、実施例に係る情報処理装置は、同様の判定を繰り返していき、２９８個目の組み合わせ（グループの総当たりの最後の組み合わせ）として、図２Ｅに示すように、グループＣ３とグループＤＤ１０を抽出し、同様の処理を行う。 Then, the information processing apparatus according to the embodiment repeats the same determination, and extracts group C3 and group DD10 as the 298th combination (the last combination of group round robin) as shown in FIG. 2E. Then, perform the same processing.

　以上の処理の結果として、実施例に係る情報処理装置は、グループＡＡ１とグループＢ１とグループＤＤ１の頻出度を３と特定する。また、実施例に係る情報処理装置は、グループＡＡ２とグループＤＤ２、グループＡＡ３とグループＤＤ３、グループＡＡ４とグループＤＤ４、グループＡＡ５とグループＤＤ５、グループＡＡ６とグループＤＤ６の頻出度を１と特定する。また、実施例に係る情報処理装置は、グループＡＡ７とグループＤＤ７、グループＡＡ８とグループＤＤ８、グループＡＡ９とグループＤＤ９、グループＡＡ１０とグループＤＤ１０、グループＢ６とグループＣ３の頻出度を１と特定する。また、実施例に係る情報処理装置は、他のグループの頻出度を０と特定する。 As a result of the above processing, the information processing apparatus according to the embodiment specifies that the frequency of occurrence of group AA1, group B1 and group DD1 is 3. Further, the information processing apparatus according to the embodiment specifies that the frequency of occurrence of group AA2 and group DD2, group AA3 and group DD3, group AA4 and group DD4, group AA5 and group DD5, and group AA6 and group DD6 is 1. Further, the information processing apparatus according to the embodiment specifies that the frequency of occurrence of group AA7 and group DD7, group AA8 and group DD8, group AA9 and group DD9, group AA10 and group DD10, and group B6 and group C3 is 1. Further, the information processing apparatus according to the embodiment specifies that the frequency of occurrence of other groups is 0.

　そして、実施例に係る情報処理装置は、作成中のデータフローに対してデータベースを参照してリコメンドを行う。図２Ｆは、作成中のデータフローを示す図である。実施例に係る情報処理装置は、図２Ｆに示すデータフローＵにおいて、「Ｄａｔａ２．ｃｓｖ」と「Ｄａｔａ１．ｃｓｖ」の統計的な差異として「値の数の増加」を特定する。また、実施例に係る情報処理装置は、データフローＵにおいて、「Ｄａｔａ３．ｃｓｖ」と「Ｄａｔａ２．ｃｓｖ」の統計的な差異として「新しい列の算出」を特定する。また、実施例に係る情報処理装置は、統計的な差異「値の数の増加」を生み出すアルゴリズムとして「補間」を特定し、統計的な差異「新しい列の算出」を生み出すアルゴリズムとして「不明」を特定する。そして、実施例に係る情報処理装置は、「Ｐｙｔｈｏｎ２」からインポートしているライブラリ名を抽出する。 Then, the information processing device according to the embodiment refers to the database and recommends the data flow being created. FIG. 2F is a diagram showing a data flow being created. The information processing apparatus according to the embodiment specifies "increase in the number of values" as a statistical difference between "Data2.csv" and "Data1.csv" in the data flow U shown in FIG. 2F. Further, the information processing apparatus according to the embodiment specifies "calculation of a new column" as a statistical difference between "Data3.csv" and "Data2.csv" in the data flow U. In addition, the information processing apparatus according to the embodiment specifies "intertrusion" as an algorithm that produces a statistical difference "increase in the number of values", and "unknown" as an algorithm that produces a statistical difference "calculation of a new column". To identify. Then, the information processing apparatus according to the embodiment extracts the library name imported from "Phython 2".

　データフローＵのプロセス数は「２」であり、統計的差異は「値の数の増加」と「新しい列の算出」であり、アルゴリズムは「補間」と「不明」である。このため、実施例に係る情報処理装置は、データベースに蓄積されたグループの中で、以下の条件を満たす最大のグループを特定する。
　・３個以上のプロセスを有す
　・頻出度が閾値以上（例えば、閾値は１）
　・統計的差異に「値の数の増加」と「新しい列の算出」を含み、アルゴリズムに「補間」と「不明」を含む
　・「不明」のＰｙｔｈｏｎプログラムがインポートしているライブラリの名前の一致割合が０．８を超える The number of processes in the data flow U is "2", the statistical differences are "increase in the number of values" and "calculate new columns", and the algorithms are "interpolate" and "unknown". Therefore, the information processing apparatus according to the embodiment specifies the largest group that satisfies the following conditions among the groups stored in the database.
・ Has 3 or more processes ・ Frequency is above the threshold (for example, the threshold is 1)
-Statistical differences include "increase in number of values" and "calculate new columns", algorithms include "interpolation" and "unknown"-Match names of libraries imported by "unknown" Python programs The ratio exceeds 0.8

　実施例に係る情報処理装置は、上記条件を満たす最大のグループとしてグループＤＤ４を特定する。そして、実施例に係る情報処理装置は、ＤＤ４のプロセスの中で作成中のデータフローにはないプロセスを実現するプログラムとして「Ｐｙｔｈｏｎ１」、「Ｐｙｔｈｏｎ３」、「Ｐｙｔｈｏｎ５」を特定してリコメンドする。図２Ｇは、リコメンド画面を示す図である。図２Ｇに示すように、実施例に係る情報処理装置は、ＤＤ４に基づいて、作成中のデータフローの「補間」の前に「削除」を、「補間」と「不明」の間に「正規化」を、「不明」の後に「グラフ表示」を挿入することをリコメンドする。 The information processing apparatus according to the embodiment specifies the group DD4 as the largest group satisfying the above conditions. Then, the information processing apparatus according to the embodiment identifies and recommends "Phython 1", "Phython 3", and "Phython 5" as a program that realizes a process that is not in the data flow being created in the DD4 process. FIG. 2G is a diagram showing a recommendation screen. As shown in FIG. 2G, the information processing apparatus according to the embodiment has "delete" before "interpolation" of the data flow being created, and "normal" between "interpolation" and "unknown" based on DD4. It is recommended to insert "Graph display" after "Unknown".

　このように、実施例に係る情報処理装置は、複数のプロセスをリコメンドすることで、データサイエンティストに複数の選択肢を提供することができる。 In this way, the information processing apparatus according to the embodiment can provide a plurality of options to the data scientist by recommending a plurality of processes.

　次に、実施例に係る情報処理装置の機能構成について説明する。図３は、実施例に係る情報処理装置の機能構成を示す図である。図３に示すように、実施例に係る情報処理装置１０は、データフロー記憶部１１と、グループ抽出部１２と、グループ記憶部１３と、頻出度計算部１４と、データベース１５とを有する。また、実施例に係る情報処理装置１０は、作成フロー記憶部１６と、作成メタ情報記憶部１７と、検索部１８と、表示部１９とを有する。 Next, the functional configuration of the information processing device according to the embodiment will be described. FIG. 3 is a diagram showing a functional configuration of the information processing apparatus according to the embodiment. As shown in FIG. 3, the information processing apparatus 10 according to the embodiment includes a data flow storage unit 11, a group extraction unit 12, a group storage unit 13, a frequency calculation unit 14, and a database 15. Further, the information processing device 10 according to the embodiment includes a creation flow storage unit 16, a creation meta information storage unit 17, a search unit 18, and a display unit 19.

　データフロー記憶部１１は、複数のデータフローのグラフ構造の情報を記憶する。情報処理装置１０は、例えば、ユーザがマウスを用いて行った指示を受け付けてファイルからデータフローのグラフ構造の情報を読み出してデータフロー記憶部１１に格納したり追加したりする。 The data flow storage unit 11 stores information on the graph structure of a plurality of data flows. For example, the information processing device 10 receives an instruction given by the user using the mouse, reads out information on the graph structure of the data flow from the file, and stores or adds it to the data flow storage unit 11.

　図４は、データフロー記憶部１１の一例を示す図である。図４に示すように、データフロー記憶部１１は、データフローを識別するデータフロー名とデータフローのグラフ構造の情報を対応付けて記憶する。データフロー記憶部１１は、例えば、データフローＡについて、「Ｄａｔａ１．ｃｓｖ→Ｐｙｔｈｏｎ１→Ｄａｔａ２．ｃｓｖ」、「Ｄａｔａ２．ｃｓｖ→Ｐｙｔｈｏｎ２→Ｄａｔａ３．ｃｓｖ」を記憶する。また、データフロー記憶部１１は、データフローＡについて、「Ｄａｔａ３．ｃｓｖ→Ｐｙｔｈｏｎ３→Ｄａｔａ４．ｃｓｖ」、「Ｄａｔａ４．ｃｓｖ→Ｐｙｔｈｏｎ４→Ｄａｔａ５．ｃｓｖ」を記憶する。 FIG. 4 is a diagram showing an example of the data flow storage unit 11. As shown in FIG. 4, the data flow storage unit 11 stores the data flow name that identifies the data flow and the information of the graph structure of the data flow in association with each other. The data flow storage unit 11 stores, for example, "Data1.csv-> Python1-> Data2.csv" and "Data2.csv-> Python2-> Data3.csv" for the data flow A. Further, the data flow storage unit 11 stores "Data3.csv-> Python3-> Data4.csv" and "Data4.csv-> Python4-> Data5.csv" for the data flow A.

　グループ抽出部１２は、データフロー記憶部１１が記憶する情報を用いて全てのグループを抽出し、各グループについて、メタデータを特定して、グループ記憶部１３に格納する。統計的な差異とアルゴリズム以外のメタデータとしては、グループが抽出されたデータフローに付加された説明文、データやプロセスのファイル名、データやプロセスのプロパティ情報、入出力ファイルの列名、プロセスのＩＤ等がある。あるいは、グループ抽出部１２は、ユーザからグループの説明文を受け付けてメタデータとして付加してもよい。グループ抽出部１２は、各グループについて、複数のメタデータを特定してもよい。 The group extraction unit 12 extracts all the groups using the information stored in the data flow storage unit 11, specifies the metadata for each group, and stores it in the group storage unit 13. Metadata other than statistical differences and algorithms includes descriptive text added to the data flow from which the group was extracted, data and process filenames, data and process property information, I / O file column names, and process There is an ID etc. Alternatively, the group extraction unit 12 may receive a description of the group from the user and add it as metadata. The group extraction unit 12 may specify a plurality of metadata for each group.

　グループ記憶部１３は、グループのメタデータを記憶する。図５は、グループ記憶部１３の一例を示す図である。図５は、メタデータが統計的な差異とアルゴリズムである場合を示す。図５に示すように、グループ記憶部１３は、グループを識別するグループＮｏ.に対応付けて、アルゴリズムと統計的な差異とを記憶する。例えば、グループ記憶部１３は、グループＡ１について、アルゴリズムとして「削除→補間」を記憶し、統計的な差異として「行数の減少→値の数の増加」を記憶する。 The group storage unit 13 stores the group metadata. FIG. 5 is a diagram showing an example of the group storage unit 13. FIG. 5 shows the case where the metadata is statistical differences and algorithms. As shown in FIG. 5, the group storage unit 13 stores the algorithm and the statistical difference in association with the group No. that identifies the group. For example, the group storage unit 13 stores "deletion-> interpolation" as an algorithm for group A1 and "decrease in the number of rows-> increase in the number of values" as a statistical difference.

　頻出度計算部１４は、グループの頻出度を計算し、グループの情報に対応付けてデータベース１５に格納する。頻出度計算部１４は、２つのグループでメタデータが類似すると、頻出度に１を加える。例えば、頻出度計算部１４は、メタデータごとに類似度を定義して、類似度が所定の閾値以上の場合に、メタデータが類似すると判定する。複数のメタデータを用いる場合には、頻出度計算部１４は、例えば、１つのメタデータが類似するごとに頻出度に１を加える。 The frequency calculation unit 14 calculates the frequency of the group and stores it in the database 15 in association with the group information. The frequency calculation unit 14 adds 1 to the frequency when the metadata is similar in the two groups. For example, the frequency calculation unit 14 defines the similarity for each metadata, and determines that the metadata is similar when the similarity is equal to or greater than a predetermined threshold value. When a plurality of metadata are used, the frequency calculation unit 14 adds 1 to the frequency every time one metadata is similar, for example.

　例えば、メタデータが統計的な差異とアルゴリズムである場合、頻出度計算部１４は、アルゴリズムと統計的な差異が同じ場合に、頻出度に１を加える。なお、アルゴリズムが「不明」である場合には、頻出度計算部１４は、対応するプロセスを実現するプログラムからライブラリ名を取得し、２つのグループで、ライブラリ名が一致する割合が０．８を超えていれば、「不明」に関してアルゴリズムが一致するとする。ここで、０．８は、閾値の例であり、他の値でもよい。 For example, when the metadata is a statistical difference and an algorithm, the frequency calculation unit 14 adds 1 to the frequency when the algorithm and the statistical difference are the same. When the algorithm is "unknown", the frequency calculation unit 14 obtains the library name from the program that realizes the corresponding process, and the ratio of matching library names in the two groups is 0.8. If it exceeds, the algorithm matches for "unknown". Here, 0.8 is an example of the threshold value, and other values may be used.

　データベース１５は、グループの情報と頻出度を対応付けて記憶し、データフロー作成の際に参照される。図６は、データベース１５の一例を示す図である。図６は、メタデータが統計的な差異とアルゴリズムである場合を示す。図６に示すように、データベース１５は、アルゴリズムと、統計的な差異と、プロセス数と、頻出度と、グループ名と、プログラム名を対応付けて記憶する。 The database 15 stores group information in association with the frequency of occurrence and is referred to when creating a data flow. FIG. 6 is a diagram showing an example of the database 15. FIG. 6 shows the case where the metadata is statistical differences and algorithms. As shown in FIG. 6, the database 15 stores the algorithm, the statistical difference, the number of processes, the frequency of occurrence, the group name, and the program name in association with each other.

　プロセス数は、グループに含まれるプロセスの数である。グループ名は、アルゴリズムと統計的な差異で特定されるグループを識別する名前である。アルゴリズムと統計的な差異が同じグループが複数ある場合には、グループ名は複数になる。プログラム名は、グループ名に対応付けられ、グループ名で識別されるグループに含まれるプロセスを実現するプログラムである。 The number of processes is the number of processes included in the group. The group name is a name that identifies the group identified by the algorithm and statistical differences. If there are multiple groups with the same statistical difference from the algorithm, the group name will be multiple. The program name is a program that realizes the process included in the group that is associated with the group name and is identified by the group name.

　例えば、「行数の減少→値の数の増加」と「削除→補間」で特定されるグループＡ１及びＢ１のプロセスの数は「２」であり、頻出度は「１」である。グループＡ１の処理は、データフローＡの「Ｐｙｔｈｏｎ１」と「Ｐｙｔｈｏｎ２」を「Ｐｙｔｈｏｎ１→Ｐｙｔｈｏｎ２」の順に実行することで実現される。 For example, the number of processes in groups A1 and B1 specified by "decrease in the number of rows → increase in the number of values" and "deletion → interpolation" is "2", and the frequency is "1". The processing of the group A1 is realized by executing "Phython 1" and "Phython 2" of the data flow A in the order of "Phython 1 → Python 2".

　作成フロー記憶部１６は、ユーザが作成中のデータフローのグラフ構造の情報を記憶する。情報処理装置１０は、例えば、ユーザがマウスやキーボードを用いて作成中のデータフローのグラフ構造の情報を作成フロー記憶部１６に格納する。 The creation flow storage unit 16 stores information on the graph structure of the data flow being created by the user. The information processing device 10 stores, for example, information on the graph structure of the data flow being created by the user using a mouse or keyboard in the creation flow storage unit 16.

　図７は、作成フロー記憶部１６の一例を示す図である。図７に示すように、作成フロー記憶部１６は、作成中のデータフローのグラフ構造の要素を識別する番号であるＮｏ．と要素のグラフ構造とを対応付けて記憶する。ここで、要素は１つのプロセスとその入力データ及び出力データのグラフ構造である。例えば、識別する番号が「１」である要素のグラフ構造は「Ｄａｔａ１．ｃｓｖ→Ｐｙｔｈｏｎ１→Ｄａｔａ２．ｃｓｖ」である。 FIG. 7 is a diagram showing an example of the creation flow storage unit 16. As shown in FIG. 7, the creation flow storage unit 16 has a number that identifies an element of the graph structure of the data flow being created. And the graph structure of the element are associated and stored. Here, the element is a graph structure of one process and its input data and output data. For example, the graph structure of the element whose identification number is "1" is "Data1.csv-> Phython1-> Data2.csv".

　グループ抽出部１２は、作成フロー記憶部１６が記憶する情報を用いて作成中のデータフローのメタデータを特定して、作成メタ情報記憶部１７に格納する。作成メタ情報記憶部１７は、作成中のデータフローのメタデータを記憶する。 The group extraction unit 12 identifies the metadata of the data flow being created using the information stored in the creation flow storage unit 16 and stores it in the creation meta information storage unit 17. The creation meta information storage unit 17 stores the metadata of the data flow being created.

　検索部１８は、作成メタ情報記憶部１７が記憶するメタデータを用いて、作成中のデータフローに最も類似するグループをデータベース１５から検索する。なお、情報処理装置１０は、作成中のデータフローに最も類似するグループの代わりに、類似するグループを検索してもよい。 The search unit 18 searches the database 15 for the group most similar to the data flow being created, using the metadata stored in the created meta information storage unit 17. The information processing device 10 may search for a similar group instead of the group most similar to the data flow being created.

　例えば、メタデータが統計的な差異とアルゴリズムである場合、検索部１８は、最も類似するグループとして、以下の条件を満たす最大のグループをデータベース１５から検索する。
　・作成中のデータフローのプロセス数より多くのプロセスを有す
　・頻出度が閾値以上（例えば、閾値は１）
　・作成メタ情報記憶部１７が記憶する統計的な差異とアルゴリズムを含む
　・「不明」に対応するＰｙｔｈｏｎプログラムがインポートしているライブラリの名前の一致割合が０．８を超える For example, when the metadata is a statistical difference and an algorithm, the search unit 18 searches the database 15 for the largest group that satisfies the following conditions as the most similar group.
-Has more processes than the number of processes in the data flow being created-Frequency is above the threshold (for example, the threshold is 1)
-Includes statistical differences and algorithms stored in the created meta information storage unit 17.-The match ratio of the names of the libraries imported by the Python program corresponding to "Unknown" exceeds 0.8.

　なお、検索部１８は、ライブラリ名が一致する割合を０．８より小さくすることで類似するグループを特定してもよい。あるいは、検索部１８は、１つのアルゴリズムだけを除いて作成中のデータフローの統計的な差異とアルゴリズムを含むグループを類似するグループとして特定してもよい。検索部１８は、作成中のデータフローにないプロセス及び当該プロセスの作成中のデータフローにおける位置を特定する。 Note that the search unit 18 may specify similar groups by making the ratio of matching library names less than 0.8. Alternatively, the search unit 18 may specify the statistical difference of the data flow being created except for one algorithm and the group including the algorithm as a similar group. The search unit 18 identifies a process that is not in the data flow being created and a position in the data flow that the process is being created.

　表示部１９は、検索部１８が特定した位置にプロセスを実現するプログラムと入出力データをリコメンド情報として出力し、図示しない表示装置に表示させる。また、情報処理装置１０は、プリンタ用出力部を介してリコメンド情報をプリンタに出力してもよい。図１Ｇ及び図２Ｇは、表示部１９による表示例を示す。 The display unit 19 outputs the program that realizes the process and the input / output data as recommendation information at the position specified by the search unit 18, and displays them on a display device (not shown). Further, the information processing device 10 may output the recommendation information to the printer via the printer output unit. 1G and 2G show a display example by the display unit 19.

　次に、情報処理装置１０による処理のフローについて説明する。図８は、情報処理装置１０による処理のフローを示すフローチャートである。図８において、ステップＳ１～ステップＳ５は、データベース１５を作成する処理であり、ステップＳ６～ステップＳ９は、リコメンドすべき追加するプロセスを抽出する処理である。 Next, the processing flow by the information processing device 10 will be described. FIG. 8 is a flowchart showing a processing flow by the information processing apparatus 10. In FIG. 8, steps S1 to S5 are processes for creating the database 15, and steps S6 to S9 are processes for extracting additional processes to be recommended.

　図８に示すように、情報処理装置１０は、２つのデータフローの連続する部分をグルーピングする（ステップＳ１）。ここで、グループには、２つ以上のプロセスと２つ以上のプロセスの先頭のプロセスの入力データから最後のプロセスの出力データまでのデータとが含まれる。なお、最後のプロセスの出力データはない場合もある。 As shown in FIG. 8, the information processing apparatus 10 groups continuous portions of two data flows (step S1). Here, the group includes two or more processes and data from the input data of the first process of the two or more processes to the output data of the last process. Note that there may be no output data for the last process.

　そして、情報処理装置１０は、２つのグループのメタデータを特定する（ステップＳ２）。そして、情報処理装置１０は、２つのグループのメタデータが類似していれば、グループの類似度を＋１する（ステップＳ３）。 Then, the information processing device 10 identifies the metadata of the two groups (step S2). Then, if the metadata of the two groups are similar, the information processing apparatus 10 increments the similarity of the groups by +1 (step S3).

　そして、情報処理装置１０は、データベース１５に登録されていない場合には、メタデータと頻出度をデータベース１５に登録する（ステップＳ４）。そして、情報処理装置１０は、全てのデータフローと全てのグルーピングの組み合わせで頻出度を求めたか否かを判定し（ステップＳ５）、類似度を求めていない組み合せがある場合には、ステップＳ１に戻る。 Then, when the information processing apparatus 10 is not registered in the database 15, the metadata and the frequency of occurrence are registered in the database 15 (step S4). Then, the information processing apparatus 10 determines whether or not the frequency is obtained by combining all the data flows and all the groupings (step S5), and if there is a combination for which the similarity is not obtained, the step S1 is performed. go back.

　一方、全てのデータフローと全てのグルーピングの組み合わせで頻出度を求めた場合には、情報処理装置１０は、作成中のデータフローについて、メタデータを特定する（ステップＳ６）。そして、情報処理装置１０は、データベース１５から、頻出度が所定の閾値以上のグループを抽出し、抽出したグループの中から、作成中のデータフローよりもプロセス数が１つ以上多いグループを選択する（ステップＳ７）。 On the other hand, when the frequency is obtained from the combination of all data flows and all groupings, the information processing apparatus 10 specifies metadata for the data flow being created (step S6). Then, the information processing apparatus 10 extracts a group having a frequency of frequency equal to or higher than a predetermined threshold value from the database 15, and selects a group having one or more processes more than the data flow being created from the extracted groups. (Step S7).

　そして、情報処理装置１０は、選択したグループの中から、作成中のデータフローとメタデータが最も類似するグループを特定する（ステップＳ８）。そして、情報処理装置１０は、特定したグループから、作成中のデータフローにないプロセス及び当該プロセスの作成中のデータフローにおける位置を特定し、特定した位置にプロセスを実現するプログラムと入出力データを表示する（ステップＳ９）。 Then, the information processing apparatus 10 identifies a group whose metadata is most similar to the data flow being created from the selected groups (step S8). Then, the information processing apparatus 10 identifies a process that is not in the data flow being created and a position in the data flow that is being created from the specified group, and outputs a program and input / output data that realizes the process to the specified position. Display (step S9).

　上述してきたように、実施例では、データベース１５が、グループの情報を記憶する。そして、検索部１８が、作成対象のデータフローと最も類似するグループをデータベース１５から検索する。そして、表示部１９が、検索部１８により検索されたグループから作成対象のデータフローと相違するプロセスを抽出して表示する。したがって、情報処理装置１０は、ユーザのデータフロー作成を支援することができる。 As described above, in the embodiment, the database 15 stores the group information. Then, the search unit 18 searches the database 15 for the group most similar to the data flow to be created. Then, the display unit 19 extracts and displays a process different from the data flow to be created from the group searched by the search unit 18. Therefore, the information processing device 10 can support the user's data flow creation.

　また、実施例では、データベース１５は、グループから特定されるメタデータを各グループについて記憶する。そして、検索部１８は、作成対象のデータフローよりプロセス数が多く、作成対象のデータフローとメタデータが最も類似するグループをデータベース１５から検索する。したがって、情報処理装置１０は、データフロー作成の参考となるグループを適切に検索することができる。 Further, in the embodiment, the database 15 stores the metadata specified from the groups for each group. Then, the search unit 18 searches the database 15 for a group having a larger number of processes than the data flow to be created and having the most similar metadata to the data flow to be created. Therefore, the information processing apparatus 10 can appropriately search for a group that serves as a reference for creating a data flow.

　また、実施例では、頻出度計算部１４が、他のデータフローのグループと類似するか否かに基づいてグループの頻出度を計算し、データベース１５は、グループに対応付けて頻出度を記憶する。そして、検索部１８は、頻出度が所定の閾値以上のグループをデータベース１５から検索する。したがって、情報処理装置１０は、データフロー作成の参考となるグループとして、使われる頻度が高いグループを検索することができる。 Further, in the embodiment, the frequency calculation unit 14 calculates the frequency of the group based on whether or not it is similar to the group of other data flows, and the database 15 stores the frequency in association with the group. .. Then, the search unit 18 searches the database 15 for a group whose frequency is equal to or higher than a predetermined threshold value. Therefore, the information processing apparatus 10 can search for a group that is frequently used as a reference group for creating a data flow.

　また、実施例では、データベース１５は、メタデータとして統計的な差異とアルゴリズムを記憶する。そして、検索部１８は、最も類似するグループとして、作成対象のデータフローよりプロセス数が多く、作成対象のデータフローの統計的な差異とアルゴリズムを含み、処理の数が最も大きいグループをデータベース１５から検索する。したがって、情報処理装置１０は、最も類似するグループを適切に検索することができる。 Further, in the embodiment, the database 15 stores statistical differences and algorithms as metadata. Then, as the most similar group, the search unit 18 has a larger number of processes than the data flow to be created, includes statistical differences and algorithms of the data flow to be created, and selects the group with the largest number of processes from the database 15. search for. Therefore, the information processing apparatus 10 can appropriately search for the most similar group.

　また、実施例では、プロセスのアルゴリズムが「不明」である場合に、検索部１８は、プロセスを実現するプログラムがインポートするライブラリの一致割合が０．８を超える場合に、アルゴリズムが一致すると判定する。したがって、検索部１８は、アルゴリズムが不明である場合には、アルゴリズムが一致するか否かを判定することができる。 Further, in the embodiment, when the algorithm of the process is "unknown", the search unit 18 determines that the algorithms match when the match ratio of the libraries imported by the program realizing the process exceeds 0.8. .. Therefore, when the algorithm is unknown, the search unit 18 can determine whether or not the algorithms match.

　なお、実施例では、情報処理装置１０について説明したが、情報処理装置１０が有する構成をソフトウェアによって実現することで、同様の機能を有するデータフロー作成プログラムを得ることができる。そこで、データフロー作成プログラムを実行するコンピュータについて説明する。 Although the information processing device 10 has been described in the embodiment, a data flow creation program having the same function can be obtained by realizing the configuration of the information processing device 10 by software. Therefore, a computer that executes the data flow creation program will be described.

　図９は、実施例に係るデータフロー作成プログラムを実行するコンピュータのハードウェア構成を示す図である。図９に示すように、コンピュータ５０は、メインメモリ５１と、ＣＰＵ（Central　Processing　Unit）５２と、ＬＡＮ（Local　Area　Network）インタフェース５３と、ＨＤＤ（Hard　Disk　Drive）５４とを有する。また、コンピュータ５０は、スーパーＩＯ（Input　Output）５５と、ＤＶＩ（Digital　Visual　Interface）５６と、ＯＤＤ（Optical　Disk　Drive）５７とを有する。 FIG. 9 is a diagram showing a hardware configuration of a computer that executes a data flow creation program according to an embodiment. As shown in FIG. 9, the computer 50 has a main memory 51, a CPU (Central Processing Unit) 52, a LAN (Local Area Network) interface 53, and an HDD (Hard Disk Drive) 54. Further, the computer 50 has a super IO (Input Output) 55, a DVI (Digital Visual Interface) 56, and an ODD (Optical Disk Drive) 57.

　メインメモリ５１は、プログラムやプログラムの実行途中結果等を記憶するメモリである。ＣＰＵ５２は、メインメモリ５１からプログラムを読み出して実行する中央処理装置である。ＣＰＵ５２は、メモリコントローラを有するチップセットを含む。 The main memory 51 is a memory for storing a program, a result during execution of the program, and the like. The CPU 52 is a central processing unit that reads a program from the main memory 51 and executes it. The CPU 52 includes a chipset having a memory controller.

　ＬＡＮインタフェース５３は、コンピュータ５０をＬＡＮ経由で他のコンピュータに接続するためのインタフェースである。ＨＤＤ５４は、プログラムやデータを格納するディスク装置であり、スーパーＩＯ５５は、マウスやキーボード等の入力装置を接続するためのインタフェースである。ＤＶＩ５６は、液晶表示装置を接続するインタフェースであり、ＯＤＤ５７は、ＤＶＤ、ＣＤ－Ｒの読み書きを行う装置である。 The LAN interface 53 is an interface for connecting the computer 50 to another computer via a LAN. The HDD 54 is a disk device for storing programs and data, and the super IO 55 is an interface for connecting an input device such as a mouse or a keyboard. The DVI 56 is an interface for connecting a liquid crystal display device, and the ODD 57 is a device for reading and writing DVDs and CD-Rs.

　ＬＡＮインタフェース５３は、ＰＣＩエクスプレス（ＰＣＩｅ）によりＣＰＵ５２に接続され、ＨＤＤ５４及びＯＤＤ５７は、ＳＡＴＡ（Serial　Advanced　Technology　Attachment）によりＣＰＵ５２に接続される。スーパーＩＯ５５は、ＬＰＣ（Low　Pin　Count）によりＣＰＵ５２に接続される。 The LAN interface 53 is connected to the CPU 52 by PCI Express (PCIe), and the HDD 54 and ODD 57 are connected to the CPU 52 by SATA (Serial Advanced Technology Attachment). The super IO 55 is connected to the CPU 52 by LPC (Low Pin Count).

　そして、コンピュータ５０において実行されるデータ処理プログラムは、コンピュータ５０により読み出し可能な記録媒体の一例であるＣＤ－Ｒに記憶され、ＯＤＤ５７によってＣＤ－Ｒから読み出されてコンピュータ５０にインストールされる。あるいは、データ処理プログラムは、ＬＡＮインタフェース５３を介して接続された他のコンピュータシステムのデータベース等に記憶され、これらのデータベースから読み出されてコンピュータ５０にインストールされる。そして、インストールされたデータ処理プログラムは、ＨＤＤ５４に記憶され、メインメモリ５１に読み出されてＣＰＵ５２によって実行される。 Then, the data processing program executed by the computer 50 is stored in the CD-R, which is an example of the recording medium readable by the computer 50, read from the CD-R by the ODD 57, and installed in the computer 50. Alternatively, the data processing program is stored in a database or the like of another computer system connected via the LAN interface 53, read from these databases, and installed in the computer 50. Then, the installed data processing program is stored in the HDD 54, read into the main memory 51, and executed by the CPU 52.

　１０　　情報処理装置
　１１　　データフロー記憶部
　１２　　グループ抽出部
　１３　　グループ記憶部
　１４　　頻出度計算部
　１５　　データベース
　１６　　作成フロー記憶部
　１７　　作成メタ情報記憶部
　１８　　検索部
　１９　　表示部
　５０　　コンピュータ
　５１　　メインメモリ
　５２　　ＣＰＵ
　５３　　ＬＡＮインタフェース
　５４　　ＨＤＤ
　５５　　スーパーＩＯ
　５６　　ＤＶＩ
　５７　　ＯＤＤ 10 Information processing device 11 Data flow storage unit 12 Group extraction unit 13 Group storage unit 14 Frequency calculation unit 15 Database 16 Creation flow storage unit 17 Creation meta information storage unit 18 Search unit 19 Display unit 50 Computer 51 Main memory 52 CPU
53 LAN interface 54 HDD
55 Super IO
56 DVI
57 ODD

Claims

　処理と、処理に使われるデータ及び処理結果として得られるデータとを要素として含む一連のデータフローを蓄積するデータベースと、
　作成対象のデータフローに類似するデータフローを前記データベースから抽出する抽出部と、
　前記抽出部により抽出されたデータフローから作成対象のデータフローと相違する要素を抽出し、抽出した要素を出力する出力部と
　を有することを特徴とする情報処理装置。 A database that stores a series of data flows that include processing, data used for processing, and data obtained as a result of processing as elements.
An extraction unit that extracts a data flow similar to the data flow to be created from the database,
An information processing apparatus having an output unit that extracts elements different from the data flow to be created from the data flow extracted by the extraction unit and outputs the extracted elements.
　前記データベースは、データフローから特定されるメタデータを各データフローについて記憶し、
　前記抽出部は、作成対象のデータフローに類似するデータフローとして、作成対象のデータフローより処理数が多く、作成対象のデータフローのメタデータが類似するデータフローを前記データベースから抽出し、
　前記出力部は、前記抽出部により抽出されたデータフローから作成対象のデータフローの処理に含まれない処理を抽出し、抽出した処理を出力する
　ことを特徴とする請求項１に記載の情報処理装置。 The database stores the metadata identified from the data flow for each data flow.
As a data flow similar to the data flow to be created, the extraction unit extracts from the database a data flow having a larger number of processes than the data flow to be created and having similar metadata of the data flow to be created.
The information processing according to claim 1, wherein the output unit extracts a process not included in the process of the data flow to be created from the data flow extracted by the extraction unit, and outputs the extracted process. apparatus.
　他のデータフローと類似するか否かに基づいて頻出度を計算する計算部をさらに有し、
　前記データベースは、データフローに対応付けて前記頻出度を記憶し、
　前記抽出部は、前記頻出度が第１閾値以上のデータフローを前記データベースから抽出する
　ことを特徴とする請求項１又は２に記載の情報処理装置。 It also has a calculator that calculates the frequency based on whether it is similar to other data flows.
The database stores the frequency of occurrence in association with the data flow.
The information processing apparatus according to claim 1 or 2, wherein the extraction unit extracts a data flow having a frequency of 1st threshold value or more from the database.
　前記データベースは、データフローのメタデータとして統計的な差異とアルゴリズムを記憶し、
　前記抽出部は、作成対象のデータフローに類似するデータフローとして、作成対象のデータフローより処理数が多く、作成対象のデータフローの統計的な差異とアルゴリズムを含み、処理の数が最も大きいデータフローを前記データベースから抽出する
　ことを特徴とする請求項２に記載の情報処理装置。 The database stores statistical differences and algorithms as data flow metadata.
As a data flow similar to the data flow to be created, the extraction unit has a larger number of processes than the data flow to be created, includes statistical differences and algorithms of the data flow to be created, and has the largest number of processes. The information processing apparatus according to claim 2, wherein the flow is extracted from the database.
　前記データベースは、処理を実現するプログラムを識別するプログラム名を記憶し、
　前記抽出部は、アルゴリズムが不明の処理がある場合に、前記プログラム名を用いて前記プログラムがインポートするライブラリの名前を特定し、名前の一致する割合が第２閾値を超えているとアルゴリズムが一致すると判定する
　ことを特徴とする請求項４に記載の情報処理装置。 The database stores a program name that identifies a program that realizes processing, and stores the program name.
The extraction unit identifies the name of the library to be imported by the program using the program name when the algorithm is unknown, and the algorithm matches when the matching ratio of the names exceeds the second threshold value. The information processing apparatus according to claim 4, wherein the information processing apparatus is determined to be so.
　コンピュータに、
　処理と、処理に使われるデータ及び処理結果として得られるデータとを要素として含む一連のデータフローを蓄積するデータベースから、作成対象のデータフローに類似するデータフローを抽出し、
　抽出したデータフローから作成対象のデータフローと相違する要素を抽出し、抽出した要素を出力する
　動作を行わせることを特徴とするデータフロー作成プログラム。 On the computer
A data flow similar to the data flow to be created is extracted from a database that stores a series of data flows including the processing and the data used for the processing and the data obtained as the processing result as elements.
A data flow creation program characterized in that elements different from the data flow to be created are extracted from the extracted data flow and the extracted elements are output.