WO2021038835A1 - Information processing device, and data flow creating program - Google Patents

Information processing device, and data flow creating program Download PDF

Info

Publication number
WO2021038835A1
WO2021038835A1 PCT/JP2019/034153 JP2019034153W WO2021038835A1 WO 2021038835 A1 WO2021038835 A1 WO 2021038835A1 JP 2019034153 W JP2019034153 W JP 2019034153W WO 2021038835 A1 WO2021038835 A1 WO 2021038835A1
Authority
WO
WIPO (PCT)
Prior art keywords
data flow
information processing
group
processing apparatus
created
Prior art date
Application number
PCT/JP2019/034153
Other languages
French (fr)
Japanese (ja)
Inventor
貴之 北野
Original Assignee
富士通株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 富士通株式会社 filed Critical 富士通株式会社
Priority to PCT/JP2019/034153 priority Critical patent/WO2021038835A1/en
Publication of WO2021038835A1 publication Critical patent/WO2021038835A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/20Software design
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/30Creation or generation of source code

Definitions

  • the present invention relates to an information processing device and a data flow creation program.
  • operation command information is stored in the operation command information storage means, command information to be executed corresponding to the input operation information is searched from the operation command information storage means, and the searched command is executed.
  • search information is generated from the flow information stored in the flow information storage unit, and when a search request is received, the search information is searched by the search conditions included in the search request, and from the search information that matches the search conditions.
  • search information is generated from the flow information stored in the flow information storage unit, and when a search request is received, the search information is searched by the search conditions included in the search request, and from the search information that matches the search conditions.
  • the data scientist refers to the data flow created in the past, but there is a problem that it is difficult to find a data flow that can be referred to.
  • One aspect of the present invention is to enable the output of elements for recommendation that are useful for the data flow to be created.
  • the information processing device has a database, an extraction unit, and an output unit.
  • the database accumulates a series of data flows including processing, data used for processing, and data obtained as a result of processing as elements.
  • the extraction unit extracts a data flow similar to the data flow to be created from the database.
  • the output unit extracts an element different from the data flow to be created from the data flow extracted by the extraction unit, and outputs the extracted element.
  • the present invention can enable the output of elements for recommendation that are useful for the data flow to be created.
  • FIG. 1A is a diagram showing a plurality of data flows used for creating a database.
  • FIG. 1B is a diagram showing the first group combination.
  • FIG. 1C is a diagram showing a second group combination.
  • FIG. 1D is a diagram showing the 68th group combination.
  • FIG. 1E is a diagram showing the 117th (last) group combination.
  • FIG. 1F is a diagram showing a data flow being created.
  • FIG. 1G is a diagram showing a recommendation screen.
  • FIG. 2A is a diagram showing a plurality of data flows used for creating a database.
  • FIG. 2B is a diagram showing the first group combination.
  • FIG. 2C is a diagram showing a second group combination.
  • FIG. 2D is a diagram showing the 130th group combination.
  • FIG. 2E is a diagram showing the 298th (last) group combination.
  • FIG. 2F is a diagram showing a data flow being created.
  • FIG. 2G is a diagram showing a recommendation screen.
  • FIG. 3 is a diagram showing a functional configuration of the information processing apparatus according to the embodiment.
  • FIG. 4 is a diagram showing an example of a data flow storage unit.
  • FIG. 5 is a diagram showing an example of a group storage unit.
  • FIG. 6 is a diagram showing an example of a database.
  • FIG. 7 is a diagram showing an example of the creation flow storage unit.
  • FIG. 8 is a flowchart showing a processing flow by the information processing apparatus.
  • FIG. 9 is a diagram showing a hardware configuration of a computer that executes a data flow creation program according to an embodiment.
  • the elliptical icon represents the process and the card icon represents the data.
  • the data is a csv (comma-separated values) file.
  • “Python” is a programming language, and "Python” in the ellipse indicates that the process is realized by a Python program created by "Python”.
  • the information processing apparatus uses a plurality of data flows to specify the metadata and the frequency of the partial data flow including the plurality of processes, and the specified metadata and the specified metadata.
  • the metadata is information associated with the partial data flow, and the details of the metadata will be described later.
  • the frequency is a value indicating the frequency with which the partial data flow is used.
  • the information processing apparatus uses a partial data flow that is most similar to the data flow being created and has a larger number of processes than the data flow being created as metadata. Search from the database using the frequency and extract the elements to be recommended.
  • FIG. 1A is a diagram showing a plurality of data flows used for creating a database.
  • four data flows represented by data flow A to data flow D are used for creating a database.
  • the information processing apparatus according to the embodiment specifies "decrease in the number of rows” as a statistical difference between “Data2.csv” and “Data1.csv” in the data flow A. Further, the information processing apparatus according to the embodiment specifies “increase in the number of values” as a statistical difference between "Data3.csv” and “Data2.csv” in the data flow A. Other statistical differences include “increase in number of rows”, “decrease in number of values”, “decrease in range of values”, “increase in range of values”, “decrease in value types", and “value”. There are “increase of types”, “calculation of new columns", etc. The information processing apparatus according to the embodiment identifies these statistical differences by comparing the input data and the output data.
  • the information processing apparatus specifies "deletion” as an algorithm of the process “Phython 1” that produces a statistical difference “decrease in the number of lines” between “Data2.csv” and “Data1.csv”.
  • the combination of statistical differences and algorithms is an example of metadata.
  • the identified algorithm is displayed below the process.
  • the information processing apparatus specifies "interpolation” as an algorithm of the process “Phython 2” that produces a statistical difference "increase in the number of values” between "Data3.csv” and "Data2.csv”.
  • the information processing apparatus specifies "normalization” as an algorithm of the process “Phython3” that produces a statistical difference between “Data4.csv” and “Data3.csv”. Further, in the information processing apparatus according to the embodiment, the algorithm of the process “Phython 4" that produces a statistical difference between "Data5.csv” and “Data4.csv” is unknown, so the algorithm is set to "unknown”. Further, the information processing apparatus according to the embodiment specifies "name identification" as another algorithm in the data flow B.
  • the information processing apparatus is a group of partial data flows including two or more processes and data from the input data of the first process of the two or more processes to the output data of the last process. Extract everything from. However, there may be no output data for the last process. Then, the information processing apparatus according to the embodiment identifies statistical differences and algorithms for two groups included in different data flows, and determines whether or not the corresponding statistical differences and the corresponding algorithms match. judge.
  • the information processing apparatus determines that the two groups are the same, and adds 1 to the frequency of the groups. Then, the information processing apparatus according to the embodiment stores the matching statistical difference, algorithm, number of processes, and two groups of Python program names in the database in association with the frequency. Further, the information processing apparatus according to the embodiment determines whether or not the two groups are the same for all combinations of the groups.
  • the information processing apparatus extracts "Data1.csv-> Python1-> Data2.csv-> Python2-> Data3.csv” from the data flow A as a group A1.
  • group A1 is a group whose group number for identifying the group is "A1”.
  • the information processing apparatus according to the embodiment extracts "Data1.csv-> Python1-> Data2.csv-> Python2-> Data3.csv” from the data flow B as a group B1.
  • the information processing apparatus specifies "decrease in the number of rows” as a statistical difference between "Data2.csv” and “Data1.csv” in group A1. Further, the information processing apparatus according to the embodiment specifies “increase in the number of values” as a statistical difference between "Data3.csv” and “Data2.csv” in group A1. Further, the information processing apparatus according to the embodiment specifies “deletion” as an algorithm that produces a statistical difference "decrease in the number of rows”, and “interpolates” as an algorithm that produces a statistical difference "increase in the number of values”. To identify.
  • the information processing apparatus specifies "decrease in the number of rows” as a statistical difference between "Data2.csv” and “Data1.csv” in group B1. Further, the information processing apparatus according to the embodiment specifies “increase in the number of values” as a statistical difference between "Data3.csv” and “Data2.csv” in group B1. Further, the information processing apparatus according to the embodiment specifies “deletion” as an algorithm that produces a statistical difference "decrease in the number of rows”, and “interpolates” as an algorithm that produces a statistical difference "increase in the number of values”. To identify.
  • the information processing apparatus extracts group A1 and group B2 as the second combination as shown in FIG. 1C. Since the number of processes is different between the group A1 and the group B2, the information processing apparatus according to the embodiment determines that they are different groups and extracts the next group. Then, the information processing apparatus according to the embodiment repeats the same determination, and extracts group A5 and group D2 as the 68th combination as shown in FIG. 1D.
  • the information processing apparatus specifies "increase in the number of values” as a statistical difference between “Data3.csv” and “Data2.csv” in group A5. Further, the information processing apparatus according to the embodiment specifies “change of value range” as a statistical difference between “Data4.csv” and “Data3.csv” in group A5. In addition, the information processing apparatus according to the embodiment specifies "calculation of a new column” as a statistical difference between "Data5.csv” and "Data4.csv” in group A5.
  • the information processing apparatus specifies "interpolation” as an algorithm that produces a statistical difference "increase in the number of values”. Further, the information processing apparatus according to the embodiment specifies "normalization” as an algorithm that produces a statistical difference "change in the range of values”. Further, in the information processing apparatus according to the embodiment, since the algorithm that produces the statistical difference "calculation of a new column" is unknown, the algorithm is set to "unknown” and the library name imported by "Phython 4" is used. Extract.
  • the information processing apparatus specifies "increase in the number of values” as a statistical difference between “Data2.csv” and “Data1.csv” in group D2. Further, the information processing apparatus according to the embodiment specifies “change of value range” as a statistical difference between “Data3.csv” and “Data2.csv” in group D2. Further, the information processing apparatus according to the embodiment specifies “calculation of a new column” as a statistical difference between “Data4.csv” and "Data3.csv” in group D2.
  • the information processing apparatus specifies "interpolation” as an algorithm that produces a statistical difference "increase in the number of values”. Further, the information processing apparatus according to the embodiment specifies "normalization” as an algorithm that produces a statistical difference "change in the range of values”. Further, in the information processing apparatus according to the embodiment, since the algorithm that produces the statistical difference "calculation of a new column" is unknown, the algorithm is set to "unknown” and the library name imported by "Phython 3" is extracted.
  • the algorithm is "interpolation-> normalization-> library name match ratio of more than 0.8"
  • the statistical difference is "increase in the number of values-> change the range of values->”. Add 1 to the frequency of the group represented by "Calculation of new column”.
  • the information processing apparatus according to the embodiment uses "interpolation-> normalization-> library name match ratio of more than 0.8” as an algorithm and "increases the number of values-> changes the range of values-> new” as a statistical difference.
  • "Calculate column” and "3" as the number of processes are saved in the database.
  • the information processing apparatus stores "Python2-> Python3-> Python4" as the A5 Python program and "Python1-> Python2-> Python3" as the D2 Python program in the database. Further, the information processing apparatus according to the embodiment stores "1" as the frequency of occurrence in the database.
  • the information processing apparatus repeats the same determination, and extracts group C3 and group D3 as the 117th combination (the last combination of group round robin) as shown in FIG. 1E. Then, perform the same processing.
  • the information processing apparatus has a frequency of 1 for group A1 and group B1, group A4 and group D1, group A6 and group D3, group B6 and group C3, and group A5 and group D2. ,
  • the frequency of occurrence of other groups is specified as 0.
  • FIG. 1F is a diagram showing a data flow being created.
  • the information processing apparatus according to the embodiment specifies "increase in the number of values” as a statistical difference between "Data2.csv” and “Data1.csv” in the data flow U shown in FIG. 1F. Further, the information processing apparatus according to the embodiment specifies "calculation of a new column” as a statistical difference between "Data3.csv” and "Data2.csv” in the data flow U.
  • the information processing apparatus specifies "interpolation” as an algorithm that produces a statistical difference "increase in the number of values”, and "unknown” an algorithm that produces a statistical difference "calculation of a new column”. Then, the name of the library being imported is extracted from "Phython2".
  • the number of processes in the data flow U is "2"
  • the statistical differences are “increase in the number of values” and “calculate new columns”
  • the algorithms are “interpolate” and "unknown”. Therefore, the information processing apparatus according to the embodiment specifies the largest group that satisfies the following conditions among the groups stored in the database. ⁇ Has 3 or more processes ⁇ Frequency is above the threshold (for example, the threshold is 1) -Statistical differences include “increase in number of values” and “calculate new columns”, algorithms include “interpolation” and "unknown"-"Unknown” Python program imports the name of the library Match rate exceeds 0.8
  • the information processing apparatus specifies group D2 as the largest group satisfying the above conditions, and specifies "Phython 2" as a program that realizes a process that is not in the data flow being created in the process of D2. Recommend.
  • FIG. 1G is a diagram showing a recommendation screen. As shown in FIG. 1G, the information processing apparatus according to the embodiment recommends inserting "normalization” between “interpolation” and "unknown” of the data flow being created based on D2.
  • the information processing apparatus displays the recommended process together with the input / output data, for example, with a green frame.
  • the information processing apparatus searches the database for a reference data flow and displays it, so that it is possible to support the creation of the data flow by the data scientist.
  • FIG. 2A is a diagram showing a plurality of data flows used for creating a database.
  • four data flows represented by data flow AA, data flow B, data flow C and data flow DD are used to create the database.
  • the information processing apparatus extracts "Data1.csv-> Python1-> Data2.csv-> Python2-> Data3.csv” from the data flow AA as a group AA1. Further, the information processing apparatus according to the embodiment extracts "Data1.csv-> Python1-> Data2.csv-> Python2-> Data3.csv” from the data flow B as a group B1.
  • the information processing apparatus specifies "decrease in the number of rows” as a statistical difference between "Data2.csv” and “Data1.csv” in group AA1.
  • the information processing apparatus specifies "increase in the number of values” as a statistical difference between "Data3.csv” and “Data2.csv” in group AA1.
  • the information processing apparatus specifies "deletion” as an algorithm that produces a statistical difference "decrease in the number of rows”, and “interpolates” as an algorithm that produces a statistical difference "increase in the number of values”. To identify.
  • the information processing apparatus specifies "decrease in the number of rows” as a statistical difference between "Data2.csv” and “Data1.csv” in group B1. Further, the information processing apparatus according to the embodiment specifies “increase in the number of values” as a statistical difference between "Data3.csv” and “Data2.csv” in group B1. Further, the information processing apparatus according to the embodiment specifies “deletion” as an algorithm that produces a statistical difference "decrease in the number of rows”, and “interpolates” as an algorithm that produces a statistical difference "increase in the number of values”. To identify.
  • the information processing apparatus extracts group AA1 and group B2 as the second combination as shown in FIG. 2C. Since the number of processes is different between the group AA1 and the group B2, the information processing apparatus according to the embodiment determines that they are different groups and extracts the next group. Then, the information processing apparatus according to the embodiment repeats the same determination, and extracts group AA7 and group DD7 as the 130th combination as shown in FIG. 2D.
  • the information processing apparatus specifies "increase in the number of values” as a statistical difference between “Data3.csv” and “Data2.csv” in group AA7. Further, the information processing apparatus according to the embodiment specifies “change of value range” as a statistical difference between “Data4.csv” and “Data3.csv” in group AA7. Further, the information processing apparatus according to the embodiment specifies “calculation of a new column” as a statistical difference between "Data5.csv” and “Data4.csv” in group AA7. Further, the information processing apparatus according to the embodiment specifies "no output file” as a statistical difference in the group AA7.
  • the information processing apparatus specifies "interpolation” as an algorithm that produces a statistical difference "increase in the number of values”. Further, the information processing apparatus according to the embodiment specifies "normalization” as an algorithm that produces a statistical difference "change in the range of values”. Further, the information processing apparatus according to the embodiment sets the algorithm that produces the statistical difference "calculation of a new column” to "unknown", and extracts the library name imported by "Phython 4". Further, the information processing apparatus according to the embodiment specifies "graph display” as an algorithm for producing "no output file”.
  • the information processing apparatus specifies "increase in the number of values” as a statistical difference between “Data3.csv” and “Data2.csv” in group DD7. Further, the information processing apparatus according to the embodiment specifies “change of value range” as a statistical difference between “Data4.csv” and “Data3.csv” in the group DD7. Further, the information processing apparatus according to the embodiment specifies “calculation of a new column” as a statistical difference between "Data5.csv” and “Data4.csv” in the group DD7. Further, the information processing apparatus according to the embodiment specifies "no output file” as a statistical difference in the group DD7.
  • the information processing apparatus specifies "interpolation” as an algorithm that produces a statistical difference "increase in the number of values”. Further, the information processing apparatus according to the embodiment specifies "normalization” as an algorithm that produces a statistical difference "change in the range of values”. Further, the information processing apparatus according to the embodiment sets the algorithm that produces the statistical difference "calculation of a new column” to "unknown", and extracts the library name imported by "Phython 4". Further, the information processing apparatus according to the embodiment specifies "graph display” as an algorithm for producing "no output file”.
  • the corresponding statistical differences are the same for "increase the number of values”, “change the range of values”, “calculate a new column” and “no output file”, and the corresponding algorithms. "Interpolation”, “normalization” and “graph display” are the same. Further, since the algorithm that produces the statistical difference "calculation of a new column” is unknown, the information processing apparatus according to the embodiment determines whether or not the ratio of matching library names exceeds 0.8. To do. Then, when the ratio of matching library names exceeds 0.8, the information processing apparatus according to the embodiment determines that the statistical difference "calculation of a new column" has been created in the same manner. It is determined that the group AA7 and the group DD7 match.
  • the algorithm is "interpolation-> normalization-> library name match ratio exceeding 0.8-> graph display", and the statistical difference is "increase in the number of values-> the range of values”. Add 1 to the frequency of the group represented by "Change-> Calculate new column-> No output file”. Then, the information processing apparatus according to the embodiment uses "interpolation-> normalization-> library name match ratio of more than 0.8-> graph display” as an algorithm, and "increase in the number of values-> value range” as a statistical difference. Change-> Calculate new column-> No output file "is saved in the database.
  • the information processing apparatus stores "4" as the number of processes and "Python2 ⁇ Python3 ⁇ Python4 ⁇ Python5" as the Python program of AA7 in the database. Further, the information processing apparatus according to the embodiment stores "Python 2 ⁇ Python 3 ⁇ Python 4 ⁇ Python 5" as the Python program of DD7 and "1" as the frequency of occurrence in the database.
  • the information processing apparatus repeats the same determination, and extracts group C3 and group DD10 as the 298th combination (the last combination of group round robin) as shown in FIG. 2E. Then, perform the same processing.
  • the information processing apparatus specifies that the frequency of occurrence of group AA1, group B1 and group DD1 is 3. Further, the information processing apparatus according to the embodiment specifies that the frequency of occurrence of group AA2 and group DD2, group AA3 and group DD3, group AA4 and group DD4, group AA5 and group DD5, and group AA6 and group DD6 is 1. Further, the information processing apparatus according to the embodiment specifies that the frequency of occurrence of group AA7 and group DD7, group AA8 and group DD8, group AA9 and group DD9, group AA10 and group DD10, and group B6 and group C3 is 1. Further, the information processing apparatus according to the embodiment specifies that the frequency of occurrence of other groups is 0.
  • FIG. 2F is a diagram showing a data flow being created.
  • the information processing apparatus according to the embodiment specifies "increase in the number of values” as a statistical difference between "Data2.csv” and “Data1.csv” in the data flow U shown in FIG. 2F. Further, the information processing apparatus according to the embodiment specifies "calculation of a new column” as a statistical difference between "Data3.csv” and "Data2.csv” in the data flow U.
  • the information processing apparatus specifies "intertrusion” as an algorithm that produces a statistical difference "increase in the number of values”, and "unknown” as an algorithm that produces a statistical difference "calculation of a new column”. To identify. Then, the information processing apparatus according to the embodiment extracts the library name imported from "Phython 2".
  • the number of processes in the data flow U is "2"
  • the statistical differences are “increase in the number of values” and “calculate new columns”
  • the algorithms are “interpolate” and "unknown”. Therefore, the information processing apparatus according to the embodiment specifies the largest group that satisfies the following conditions among the groups stored in the database. ⁇ Has 3 or more processes ⁇ Frequency is above the threshold (for example, the threshold is 1) -Statistical differences include “increase in number of values” and “calculate new columns”, algorithms include “interpolation” and "unknown"-Match names of libraries imported by "unknown” Python programs The ratio exceeds 0.8
  • the information processing apparatus specifies the group DD4 as the largest group satisfying the above conditions. Then, the information processing apparatus according to the embodiment identifies and recommends "Phython 1", “Phython 3", and "Phython 5" as a program that realizes a process that is not in the data flow being created in the DD4 process.
  • FIG. 2G is a diagram showing a recommendation screen. As shown in FIG. 2G, the information processing apparatus according to the embodiment has “delete” before “interpolation” of the data flow being created, and “normal” between “interpolation” and “unknown” based on DD4. It is recommended to insert "Graph display” after "Unknown”.
  • the information processing apparatus can provide a plurality of options to the data scientist by recommending a plurality of processes.
  • FIG. 3 is a diagram showing a functional configuration of the information processing apparatus according to the embodiment.
  • the information processing apparatus 10 according to the embodiment includes a data flow storage unit 11, a group extraction unit 12, a group storage unit 13, a frequency calculation unit 14, and a database 15. Further, the information processing device 10 according to the embodiment includes a creation flow storage unit 16, a creation meta information storage unit 17, a search unit 18, and a display unit 19.
  • the data flow storage unit 11 stores information on the graph structure of a plurality of data flows.
  • the information processing device 10 receives an instruction given by the user using the mouse, reads out information on the graph structure of the data flow from the file, and stores or adds it to the data flow storage unit 11.
  • FIG. 4 is a diagram showing an example of the data flow storage unit 11.
  • the data flow storage unit 11 stores the data flow name that identifies the data flow and the information of the graph structure of the data flow in association with each other.
  • the data flow storage unit 11 stores, for example, "Data1.csv-> Python1-> Data2.csv” and “Data2.csv-> Python2-> Data3.csv” for the data flow A.
  • the data flow storage unit 11 stores "Data3.csv-> Python3-> Data4.csv” and "Data4.csv-> Python4-> Data5.csv” for the data flow A.
  • the group extraction unit 12 extracts all the groups using the information stored in the data flow storage unit 11, specifies the metadata for each group, and stores it in the group storage unit 13. Metadata other than statistical differences and algorithms includes descriptive text added to the data flow from which the group was extracted, data and process filenames, data and process property information, I / O file column names, and process There is an ID etc.
  • the group extraction unit 12 may receive a description of the group from the user and add it as metadata.
  • the group extraction unit 12 may specify a plurality of metadata for each group.
  • the group storage unit 13 stores the group metadata.
  • FIG. 5 is a diagram showing an example of the group storage unit 13.
  • FIG. 5 shows the case where the metadata is statistical differences and algorithms.
  • the group storage unit 13 stores the algorithm and the statistical difference in association with the group No. that identifies the group.
  • the group storage unit 13 stores "deletion-> interpolation" as an algorithm for group A1 and "decrease in the number of rows-> increase in the number of values" as a statistical difference.
  • the frequency calculation unit 14 calculates the frequency of the group and stores it in the database 15 in association with the group information.
  • the frequency calculation unit 14 adds 1 to the frequency when the metadata is similar in the two groups.
  • the frequency calculation unit 14 defines the similarity for each metadata, and determines that the metadata is similar when the similarity is equal to or greater than a predetermined threshold value.
  • the frequency calculation unit 14 adds 1 to the frequency every time one metadata is similar, for example.
  • the frequency calculation unit 14 adds 1 to the frequency when the algorithm and the statistical difference are the same.
  • the frequency calculation unit 14 obtains the library name from the program that realizes the corresponding process, and the ratio of matching library names in the two groups is 0.8. If it exceeds, the algorithm matches for "unknown".
  • 0.8 is an example of the threshold value, and other values may be used.
  • the database 15 stores group information in association with the frequency of occurrence and is referred to when creating a data flow.
  • FIG. 6 is a diagram showing an example of the database 15.
  • FIG. 6 shows the case where the metadata is statistical differences and algorithms.
  • the database 15 stores the algorithm, the statistical difference, the number of processes, the frequency of occurrence, the group name, and the program name in association with each other.
  • the number of processes is the number of processes included in the group.
  • the group name is a name that identifies the group identified by the algorithm and statistical differences. If there are multiple groups with the same statistical difference from the algorithm, the group name will be multiple.
  • the program name is a program that realizes the process included in the group that is associated with the group name and is identified by the group name.
  • the number of processes in groups A1 and B1 specified by "decrease in the number of rows ⁇ increase in the number of values” and “deletion ⁇ interpolation” is "2", and the frequency is "1".
  • the processing of the group A1 is realized by executing "Phython 1" and "Phython 2" of the data flow A in the order of "Phython 1 ⁇ Python 2".
  • the creation flow storage unit 16 stores information on the graph structure of the data flow being created by the user.
  • the information processing device 10 stores, for example, information on the graph structure of the data flow being created by the user using a mouse or keyboard in the creation flow storage unit 16.
  • FIG. 7 is a diagram showing an example of the creation flow storage unit 16.
  • the creation flow storage unit 16 has a number that identifies an element of the graph structure of the data flow being created. And the graph structure of the element are associated and stored.
  • the element is a graph structure of one process and its input data and output data.
  • the graph structure of the element whose identification number is "1" is "Data1.csv-> Phython1-> Data2.csv”.
  • the group extraction unit 12 identifies the metadata of the data flow being created using the information stored in the creation flow storage unit 16 and stores it in the creation meta information storage unit 17.
  • the creation meta information storage unit 17 stores the metadata of the data flow being created.
  • the search unit 18 searches the database 15 for the group most similar to the data flow being created, using the metadata stored in the created meta information storage unit 17.
  • the information processing device 10 may search for a similar group instead of the group most similar to the data flow being created.
  • the search unit 18 searches the database 15 for the largest group that satisfies the following conditions as the most similar group. -Has more processes than the number of processes in the data flow being created-Frequency is above the threshold (for example, the threshold is 1) -Includes statistical differences and algorithms stored in the created meta information storage unit 17.-The match ratio of the names of the libraries imported by the Python program corresponding to "Unknown" exceeds 0.8.
  • search unit 18 may specify similar groups by making the ratio of matching library names less than 0.8.
  • search unit 18 may specify the statistical difference of the data flow being created except for one algorithm and the group including the algorithm as a similar group.
  • the search unit 18 identifies a process that is not in the data flow being created and a position in the data flow that the process is being created.
  • the display unit 19 outputs the program that realizes the process and the input / output data as recommendation information at the position specified by the search unit 18, and displays them on a display device (not shown). Further, the information processing device 10 may output the recommendation information to the printer via the printer output unit. 1G and 2G show a display example by the display unit 19.
  • FIG. 8 is a flowchart showing a processing flow by the information processing apparatus 10.
  • steps S1 to S5 are processes for creating the database 15
  • steps S6 to S9 are processes for extracting additional processes to be recommended.
  • the information processing apparatus 10 groups continuous portions of two data flows (step S1).
  • the group includes two or more processes and data from the input data of the first process of the two or more processes to the output data of the last process. Note that there may be no output data for the last process.
  • the information processing device 10 identifies the metadata of the two groups (step S2). Then, if the metadata of the two groups are similar, the information processing apparatus 10 increments the similarity of the groups by +1 (step S3).
  • step S4 when the information processing apparatus 10 is not registered in the database 15, the metadata and the frequency of occurrence are registered in the database 15 (step S4). Then, the information processing apparatus 10 determines whether or not the frequency is obtained by combining all the data flows and all the groupings (step S5), and if there is a combination for which the similarity is not obtained, the step S1 is performed. go back.
  • the information processing apparatus 10 specifies metadata for the data flow being created (step S6). Then, the information processing apparatus 10 extracts a group having a frequency of frequency equal to or higher than a predetermined threshold value from the database 15, and selects a group having one or more processes more than the data flow being created from the extracted groups. (Step S7).
  • the information processing apparatus 10 identifies a group whose metadata is most similar to the data flow being created from the selected groups (step S8). Then, the information processing apparatus 10 identifies a process that is not in the data flow being created and a position in the data flow that is being created from the specified group, and outputs a program and input / output data that realizes the process to the specified position. Display (step S9).
  • the database 15 stores the group information. Then, the search unit 18 searches the database 15 for the group most similar to the data flow to be created. Then, the display unit 19 extracts and displays a process different from the data flow to be created from the group searched by the search unit 18. Therefore, the information processing device 10 can support the user's data flow creation.
  • the database 15 stores the metadata specified from the groups for each group. Then, the search unit 18 searches the database 15 for a group having a larger number of processes than the data flow to be created and having the most similar metadata to the data flow to be created. Therefore, the information processing apparatus 10 can appropriately search for a group that serves as a reference for creating a data flow.
  • the frequency calculation unit 14 calculates the frequency of the group based on whether or not it is similar to the group of other data flows, and the database 15 stores the frequency in association with the group. .. Then, the search unit 18 searches the database 15 for a group whose frequency is equal to or higher than a predetermined threshold value. Therefore, the information processing apparatus 10 can search for a group that is frequently used as a reference group for creating a data flow.
  • the database 15 stores statistical differences and algorithms as metadata.
  • the search unit 18 has a larger number of processes than the data flow to be created, includes statistical differences and algorithms of the data flow to be created, and selects the group with the largest number of processes from the database 15. search for. Therefore, the information processing apparatus 10 can appropriately search for the most similar group.
  • the search unit 18 determines that the algorithms match when the match ratio of the libraries imported by the program realizing the process exceeds 0.8. .. Therefore, when the algorithm is unknown, the search unit 18 can determine whether or not the algorithms match.
  • FIG. 9 is a diagram showing a hardware configuration of a computer that executes a data flow creation program according to an embodiment.
  • the computer 50 has a main memory 51, a CPU (Central Processing Unit) 52, a LAN (Local Area Network) interface 53, and an HDD (Hard Disk Drive) 54. Further, the computer 50 has a super IO (Input Output) 55, a DVI (Digital Visual Interface) 56, and an ODD (Optical Disk Drive) 57.
  • IO Input Output
  • DVI Digital Visual Interface
  • ODD Optical Disk Drive
  • the main memory 51 is a memory for storing a program, a result during execution of the program, and the like.
  • the CPU 52 is a central processing unit that reads a program from the main memory 51 and executes it.
  • the CPU 52 includes a chipset having a memory controller.
  • the LAN interface 53 is an interface for connecting the computer 50 to another computer via a LAN.
  • the HDD 54 is a disk device for storing programs and data
  • the super IO 55 is an interface for connecting an input device such as a mouse or a keyboard.
  • the DVI 56 is an interface for connecting a liquid crystal display device
  • the ODD 57 is a device for reading and writing DVDs and CD-Rs.
  • the LAN interface 53 is connected to the CPU 52 by PCI Express (PCIe), and the HDD 54 and ODD 57 are connected to the CPU 52 by SATA (Serial Advanced Technology Attachment).
  • the super IO 55 is connected to the CPU 52 by LPC (Low Pin Count).
  • the data processing program executed by the computer 50 is stored in the CD-R, which is an example of the recording medium readable by the computer 50, read from the CD-R by the ODD 57, and installed in the computer 50.
  • the data processing program is stored in a database or the like of another computer system connected via the LAN interface 53, read from these databases, and installed in the computer 50.
  • the installed data processing program is stored in the HDD 54, read into the main memory 51, and executed by the CPU 52.
  • Information processing device 11 Data flow storage unit 12 Group extraction unit 13 Group storage unit 14 Frequency calculation unit 15 Database 16 Creation flow storage unit 17 Creation meta information storage unit 18 Search unit 19 Display unit 50 Computer 51 Main memory 52 CPU 53 LAN interface 54 HDD 55 Super IO 56 DVI 57 ODD

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A database (15) stores information pertaining to groups. A group herein refers to a partial data flow comprising two or more processes and input data of the initial process of the two or more processes to output data of the last process thereof. The output data of the last process may be absent. A search unit (18) then retrieves from the database (15) groups most similar to a data flow to be created. Then, a display unit (19) extracts from the groups retrieved by the search unit (18) a process that differs from the data flow to be created, and causes the process to be displayed on a display device.

Description

情報処理装置及びデータフロー作成プログラムInformation processing device and data flow creation program
 本発明は、情報処理装置及びデータフロー作成プログラムに関する。 The present invention relates to an information processing device and a data flow creation program.
 現在、企業は、業務で蓄積したデータの利活用を積極的に進めている。データの利活用では、データサイエンティストは、データの処理の流れを示すデータフローを用いて、データ利活用のための分析を行う。 Currently, companies are actively promoting the utilization of data accumulated in their business. In data utilization, data scientists perform analysis for data utilization using data flows that show the flow of data processing.
 なお、操作命令情報を操作命令情報記憶手段に蓄積し、入力された操作情報に対応して実行すべき命令情報を操作命令情報記憶手段から検索し、検索した命令を実行する従来技術がある。 It should be noted that there is a conventional technique in which operation command information is stored in the operation command information storage means, command information to be executed corresponding to the input operation information is searched from the operation command information storage means, and the searched command is executed.
 また、フロー情報記憶部に格納されたフロー情報から検索用情報を生成し、検索要求を受け付けると該検索要求に含まれる検索条件で検索用情報を検索し、検索条件に合致する検索用情報からフロー情報を取得することで、フロー情報の検索を高速にする技術がある。 In addition, search information is generated from the flow information stored in the flow information storage unit, and when a search request is received, the search information is searched by the search conditions included in the search request, and from the search information that matches the search conditions. There is a technology to speed up the search of flow information by acquiring the flow information.
特開平11-242600号公報Japanese Unexamined Patent Publication No. 11-242600 特開2010-68279号公報Japanese Unexamined Patent Publication No. 2010-68279
 データフローを作成する場合、データサイエンティストは、過去に作成されたデータフローを参考にするが、参考にすることができるデータフローを探し出すことが困難であるという問題がある。 When creating a data flow, the data scientist refers to the data flow created in the past, but there is a problem that it is difficult to find a data flow that can be referred to.
 本発明は、1つの側面では、作成対象のデータフローに対して有用なリコメンド用の要素の出力を可能にすることを目的とする。 One aspect of the present invention is to enable the output of elements for recommendation that are useful for the data flow to be created.
 1つの態様では、情報処理装置は、データベースと抽出部と出力部とを有する。前記データベースは、処理と、処理に使われるデータ及び処理結果として得られるデータとを要素として含む一連のデータフローを蓄積する。前記抽出部は、作成対象のデータフローに類似するデータフローを前記データベースから抽出する。前記出力部は、前記抽出部により抽出されたデータフローから作成対象のデータフローと相違する要素を抽出し、抽出した要素を出力する。 In one aspect, the information processing device has a database, an extraction unit, and an output unit. The database accumulates a series of data flows including processing, data used for processing, and data obtained as a result of processing as elements. The extraction unit extracts a data flow similar to the data flow to be created from the database. The output unit extracts an element different from the data flow to be created from the data flow extracted by the extraction unit, and outputs the extracted element.
 本発明は、1つの側面では、作成対象のデータフローに対して有用なリコメンド用の要素の出力を可能にすることができる。 In one aspect, the present invention can enable the output of elements for recommendation that are useful for the data flow to be created.
図1Aは、データベースの作成に用いられる複数のデータフローを示す図である。FIG. 1A is a diagram showing a plurality of data flows used for creating a database. 図1Bは、1個目のグループ組み合わせを示す図である。FIG. 1B is a diagram showing the first group combination. 図1Cは、2個目のグループ組み合わせを示す図である。FIG. 1C is a diagram showing a second group combination. 図1Dは、68個目のグループ組み合わせを示す図である。FIG. 1D is a diagram showing the 68th group combination. 図1Eは、117個目(最後)のグループ組み合わせを示す図である。FIG. 1E is a diagram showing the 117th (last) group combination. 図1Fは、作成中のデータフローを示す図である。FIG. 1F is a diagram showing a data flow being created. 図1Gは、リコメンド画面を示す図である。FIG. 1G is a diagram showing a recommendation screen. 図2Aは、データベースの作成に用いられる複数のデータフローを示す図である。FIG. 2A is a diagram showing a plurality of data flows used for creating a database. 図2Bは、1個目のグループ組み合わせを示す図である。FIG. 2B is a diagram showing the first group combination. 図2Cは、2個目のグループ組み合わせを示す図である。FIG. 2C is a diagram showing a second group combination. 図2Dは、130個目のグループ組み合わせを示す図である。FIG. 2D is a diagram showing the 130th group combination. 図2Eは、298個目(最後)のグループ組み合わせを示す図である。FIG. 2E is a diagram showing the 298th (last) group combination. 図2Fは、作成中のデータフローを示す図である。FIG. 2F is a diagram showing a data flow being created. 図2Gは、リコメンド画面を示す図である。FIG. 2G is a diagram showing a recommendation screen. 図3は、実施例に係る情報処理装置の機能構成を示す図である。FIG. 3 is a diagram showing a functional configuration of the information processing apparatus according to the embodiment. 図4は、データフロー記憶部の一例を示す図である。FIG. 4 is a diagram showing an example of a data flow storage unit. 図5は、グループ記憶部の一例を示す図である。FIG. 5 is a diagram showing an example of a group storage unit. 図6は、データベースの一例を示す図である。FIG. 6 is a diagram showing an example of a database. 図7は、作成フロー記憶部の一例を示す図である。FIG. 7 is a diagram showing an example of the creation flow storage unit. 図8は、情報処理装置による処理のフローを示すフローチャートである。FIG. 8 is a flowchart showing a processing flow by the information processing apparatus. 図9は、実施例に係るデータフロー作成プログラムを実行するコンピュータのハードウェア構成を示す図である。FIG. 9 is a diagram showing a hardware configuration of a computer that executes a data flow creation program according to an embodiment.
 以下に、本願の開示する情報処理装置及びデータフロー作成プログラムの実施例を図面に基づいて詳細に説明する。なお、この実施例は開示の技術を限定するものではない。 Hereinafter, examples of the information processing apparatus and the data flow creation program disclosed in the present application will be described in detail based on the drawings. It should be noted that this embodiment does not limit the disclosed technology.
 まず、実施例に係る情報処理装置が行うリコメンドの例を図1A~図1Gを用いて説明する。図1A~図1Gにおいて、楕円のアイコンはプロセス(処理)を表し、カードのアイコンはデータを表す。データはcsv(comma-separated values)ファイルである。「Python」は、プログラミング言語であり、楕円の中の「Python」は、プロセスが「Python」で作成されているPythonプログラムで実現されることを示す。 First, an example of the recommendation performed by the information processing apparatus according to the embodiment will be described with reference to FIGS. 1A to 1G. In FIGS. 1A-1G, the elliptical icon represents the process and the card icon represents the data. The data is a csv (comma-separated values) file. "Python" is a programming language, and "Python" in the ellipse indicates that the process is realized by a Python program created by "Python".
 実施例に係る情報処理装置は、図1A~図1Eに示すように、複数のデータフローを用いて、複数のプロセスを含む部分データフローのメタデータと頻出度を特定し、特定したメタデータと頻出度をデータベースに記憶する。ここで、メタデータは部分データフローに紐づけられる情報であり、メタデータの詳細については後述する。また、頻出度は、部分データフローが使われた頻度を示す値である。 As shown in FIGS. 1A to 1E, the information processing apparatus according to the embodiment uses a plurality of data flows to specify the metadata and the frequency of the partial data flow including the plurality of processes, and the specified metadata and the specified metadata. Store the frequency in the database. Here, the metadata is information associated with the partial data flow, and the details of the metadata will be described later. The frequency is a value indicating the frequency with which the partial data flow is used.
 そして、実施例に係る情報処理装置は、図1F~図1Gに示すように、作成中のデータフローに最も類似し、かつ、作成中のデータフローよりプロセス数の多い部分データフローをメタデータと頻出度を用いてデータベースから検索し、リコメンド対象の要素の抽出を行う。 Then, as shown in FIGS. 1F to 1G, the information processing apparatus according to the embodiment uses a partial data flow that is most similar to the data flow being created and has a larger number of processes than the data flow being created as metadata. Search from the database using the frequency and extract the elements to be recommended.
 図1Aは、データベースの作成に用いられる複数のデータフローを示す図である。ここでは、データフローA~データフローDで表される4つのデータフローがデータベースの作成に用いられる。実施例に係る情報処理装置は、データフローAにおいて、「Data2.csv」と「Data1.csv」の統計的な差異として「行数の減少」を特定する。また、実施例に係る情報処理装置は、データフローAにおいて、「Data3.csv」と「Data2.csv」の統計的な差異として「値の数の増加」を特定する。統計的な差異としては、他に「行数の増加」、「値の数の減少」、「値の範囲の減少」、「値の範囲の増加」、「値の種類の減少」、「値の種類の増加」、「新しい列の算出」等がある。実施例に係る情報処理装置は、これらの統計的な差異を、入力データと出力データを比較することで特定する。 FIG. 1A is a diagram showing a plurality of data flows used for creating a database. Here, four data flows represented by data flow A to data flow D are used for creating a database. The information processing apparatus according to the embodiment specifies "decrease in the number of rows" as a statistical difference between "Data2.csv" and "Data1.csv" in the data flow A. Further, the information processing apparatus according to the embodiment specifies "increase in the number of values" as a statistical difference between "Data3.csv" and "Data2.csv" in the data flow A. Other statistical differences include "increase in number of rows", "decrease in number of values", "decrease in range of values", "increase in range of values", "decrease in value types", and "value". There are "increase of types", "calculation of new columns", etc. The information processing apparatus according to the embodiment identifies these statistical differences by comparing the input data and the output data.
 そして、実施例に係る情報処理装置は、「Data2.csv」と「Data1.csv」の統計的な差異「行数の減少」を生み出すプロセス「Python1」のアルゴリズムとして「削除」を特定する。統計的な差異とアルゴリズムの組み合わせは、メタデータの一例である。特定されたアルゴリズムは、プロセスの下に表示される。統計的な差異「行数の減少」を生み出すプロセスのアルゴリズムとしては、「削除」の他に「外れ値除外」がある。「削除」であるか「外れ値除外」であるかは、入力データと出力データを比較することで特定される。また、実施例に係る情報処理装置は、「Data3.csv」と「Data2.csv」の統計的な差異「値の数の増加」を生み出すプロセス「Python2」のアルゴリズムとして「補間」を特定する。 Then, the information processing apparatus according to the embodiment specifies "deletion" as an algorithm of the process "Phython 1" that produces a statistical difference "decrease in the number of lines" between "Data2.csv" and "Data1.csv". The combination of statistical differences and algorithms is an example of metadata. The identified algorithm is displayed below the process. In addition to "deletion", there is "outlier exclusion" as an algorithm of the process that produces the statistical difference "decrease in the number of rows". Whether it is "deleted" or "outlier excluded" is specified by comparing the input data and the output data. Further, the information processing apparatus according to the embodiment specifies "interpolation" as an algorithm of the process "Phython 2" that produces a statistical difference "increase in the number of values" between "Data3.csv" and "Data2.csv".
 同様に、実施例に係る情報処理装置は、「Data4.csv」と「Data3.csv」の統計的な差異を生み出すプロセス「Python3」のアルゴリズムとして「正規化」を特定する。また、実施例に係る情報処理装置は、「Data5.csv」と「Data4.csv」の統計的な差異を生み出すプロセス「Python4」のアルゴリズムが不明であるので、アルゴリズムを「不明」とする。また、実施例に係る情報処理装置は、データフローBにおいて、他のアルゴリズムとして「名寄せ」を特定する。 Similarly, the information processing apparatus according to the embodiment specifies "normalization" as an algorithm of the process "Phython3" that produces a statistical difference between "Data4.csv" and "Data3.csv". Further, in the information processing apparatus according to the embodiment, the algorithm of the process "Phython 4" that produces a statistical difference between "Data5.csv" and "Data4.csv" is unknown, so the algorithm is set to "unknown". Further, the information processing apparatus according to the embodiment specifies "name identification" as another algorithm in the data flow B.
 実施例に係る情報処理装置は、2つ以上のプロセスと2つ以上のプロセスの先頭のプロセスの入力データから最後のプロセスの出力データまでのデータとを含む部分データフローをグループとして全てのデータフローから全て抽出する。ただし、最後のプロセスの出力データはない場合もある。そして、実施例に係る情報処理装置は、異なるデータフローに含まれる2つのグループについて、統計的な差異とアルゴリズムを特定し、対応する統計的な差異と、対応するアルゴリズムが一致するか否かを判定する。 The information processing apparatus according to the embodiment is a group of partial data flows including two or more processes and data from the input data of the first process of the two or more processes to the output data of the last process. Extract everything from. However, there may be no output data for the last process. Then, the information processing apparatus according to the embodiment identifies statistical differences and algorithms for two groups included in different data flows, and determines whether or not the corresponding statistical differences and the corresponding algorithms match. judge.
 そして、対応する統計的な差異と、対応するアルゴリズムが一致する場合に、実施例に係る情報処理装置は、2つのグループは同一であると判定し、グループの頻出度に1を加える。そして、実施例に係る情報処理装置は、一致した統計的な差異、アルゴリズム、プロセス数、2つのグループのPythonプログラム名を頻出度と紐づけてデータベースに記憶する。また、実施例に係る情報処理装置は、2つのグループが同一であるか否かの判定をグループの全ての組み合わせについて行う。 Then, when the corresponding statistical difference and the corresponding algorithm match, the information processing apparatus according to the embodiment determines that the two groups are the same, and adds 1 to the frequency of the groups. Then, the information processing apparatus according to the embodiment stores the matching statistical difference, algorithm, number of processes, and two groups of Python program names in the database in association with the frequency. Further, the information processing apparatus according to the embodiment determines whether or not the two groups are the same for all combinations of the groups.
 例えば、実施例に係る情報処理装置は、図1Bに示すように、データフローAから、「Data1.csv→Python1→Data2.csv→Python2→Data3.csv」をグループA1として抽出する。ここで、「グループA1」は、グループを識別するグループ番号が「A1」であるグループである。また、実施例に係る情報処理装置は、データフローBから、「Data1.csv→Python1→Data2.csv→Python2→Data3.csv」をグループB1として抽出する。 For example, as shown in FIG. 1B, the information processing apparatus according to the embodiment extracts "Data1.csv-> Python1-> Data2.csv-> Python2-> Data3.csv" from the data flow A as a group A1. Here, "group A1" is a group whose group number for identifying the group is "A1". Further, the information processing apparatus according to the embodiment extracts "Data1.csv-> Python1-> Data2.csv-> Python2-> Data3.csv" from the data flow B as a group B1.
 そして、実施例に係る情報処理装置は、グループA1において、「Data2.csv」と「Data1.csv」の統計的な差異として「行数の減少」を特定する。また、実施例に係る情報処理装置は、グループA1において、「Data3.csv」と「Data2.csv」の統計的な差異として「値の数の増加」を特定する。また、実施例に係る情報処理装置は、統計的な差異「行数の減少」を生み出すアルゴリズムとして「削除」を特定し、統計的な差異「値の数の増加」を生み出すアルゴリズムとして「補間」を特定する。 Then, the information processing apparatus according to the embodiment specifies "decrease in the number of rows" as a statistical difference between "Data2.csv" and "Data1.csv" in group A1. Further, the information processing apparatus according to the embodiment specifies "increase in the number of values" as a statistical difference between "Data3.csv" and "Data2.csv" in group A1. Further, the information processing apparatus according to the embodiment specifies "deletion" as an algorithm that produces a statistical difference "decrease in the number of rows", and "interpolates" as an algorithm that produces a statistical difference "increase in the number of values". To identify.
 同様に、実施例に係る情報処理装置は、グループB1において、「Data2.csv」と「Data1.csv」の統計的な差異として「行数の減少」を特定する。また、実施例に係る情報処理装置は、グループB1において、「Data3.csv」と「Data2.csv」の統計的な差異として「値の数の増加」を特定する。また、実施例に係る情報処理装置は、統計的な差異「行数の減少」を生み出すアルゴリズムとして「削除」を特定し、統計的な差異「値の数の増加」を生み出すアルゴリズムとして「補間」を特定する。 Similarly, the information processing apparatus according to the embodiment specifies "decrease in the number of rows" as a statistical difference between "Data2.csv" and "Data1.csv" in group B1. Further, the information processing apparatus according to the embodiment specifies "increase in the number of values" as a statistical difference between "Data3.csv" and "Data2.csv" in group B1. Further, the information processing apparatus according to the embodiment specifies "deletion" as an algorithm that produces a statistical difference "decrease in the number of rows", and "interpolates" as an algorithm that produces a statistical difference "increase in the number of values". To identify.
 グループA1とグループB1では、対応する統計的な差異が「行数の減少」と「値の数の増加」で同じであり、対応するアルゴリズムも「削除」と「補間」で同じである。したがって、実施例に係る情報処理装置は、アルゴリズムが「削除→補間」であり、統計的な差異が「行数の減少→値の数の増加」で表されるグループの頻出度に1を加える。そして、実施例に係る情報処理装置は、アルゴリズムとして「削除→補間」を、統計的な差異として「行数の減少→値の数の増加」を、プロセス数として「2」を、頻出度として「1」を、データベースに保存する。また、実施例に係る情報処理装置は、グループA1のPythonプログラムとして「Python1→Python2」を、グループB1のPythonプログラムとして「Python1→Python2」を、データベースに保存する。 In group A1 and group B1, the corresponding statistical differences are the same for "decrease in the number of rows" and "increase in the number of values", and the corresponding algorithms are also the same for "delete" and "interpolation". Therefore, in the information processing apparatus according to the embodiment, 1 is added to the frequency of the group in which the algorithm is "deletion-> interpolation" and the statistical difference is "decrease in the number of rows-> increase in the number of values". .. Then, the information processing apparatus according to the embodiment uses "deletion-> interpolation" as an algorithm, "decrease in the number of rows-> increase in the number of values" as a statistical difference, and "2" as the number of processes as the frequency. Save "1" in the database. Further, the information processing apparatus according to the embodiment stores "Python1 → Python2" as the Python program of the group A1 and "Python1 → Python2" as the Python program of the group B1 in the database.
 実施例に係る情報処理装置は、次に、2個目の組み合わせとして、図1Cに示すように、グループA1とグループB2を抽出する。グループA1とグループB2は、プロセスの数が異なるため、実施例に係る情報処理装置は、異なるグループと判定して、次のグループを抽出する。そして、実施例に係る情報処理装置は、同様の判定を繰り返していき、68個目の組み合わせとして、図1Dに示すように、グループA5とグループD2を抽出する。 Next, the information processing apparatus according to the embodiment extracts group A1 and group B2 as the second combination as shown in FIG. 1C. Since the number of processes is different between the group A1 and the group B2, the information processing apparatus according to the embodiment determines that they are different groups and extracts the next group. Then, the information processing apparatus according to the embodiment repeats the same determination, and extracts group A5 and group D2 as the 68th combination as shown in FIG. 1D.
 そして、実施例に係る情報処理装置は、グループA5において、「Data3.csv」と「Data2.csv」の統計的な差異として「値の数の増加」を特定する。また、実施例に係る情報処理装置は、グループA5において、「Data4.csv」と「Data3.csv」の統計的な差異として「値の範囲の変更」を特定する。また、実施例に係る情報処理装置は、グループA5において、「Data5.csv」と「Data4.csv」の統計的な差異として「新しい列の算出」を特定する。 Then, the information processing apparatus according to the embodiment specifies "increase in the number of values" as a statistical difference between "Data3.csv" and "Data2.csv" in group A5. Further, the information processing apparatus according to the embodiment specifies "change of value range" as a statistical difference between "Data4.csv" and "Data3.csv" in group A5. In addition, the information processing apparatus according to the embodiment specifies "calculation of a new column" as a statistical difference between "Data5.csv" and "Data4.csv" in group A5.
 また、実施例に係る情報処理装置は、統計的な差異「値の数の増加」を生み出すアルゴリズムとして「補間」を特定する。また、実施例に係る情報処理装置は、統計的な差異「値の範囲の変更」を生み出すアルゴリズムとして「正規化」を特定する。また、実施例に係る情報処理装置は、統計的な差異「新しい列の算出」を生み出すアルゴリズムが不明のため、アルゴリズムを「不明」とし、「Python4」でインポート(import)しているライブラリ名を抽出する。 In addition, the information processing apparatus according to the embodiment specifies "interpolation" as an algorithm that produces a statistical difference "increase in the number of values". Further, the information processing apparatus according to the embodiment specifies "normalization" as an algorithm that produces a statistical difference "change in the range of values". Further, in the information processing apparatus according to the embodiment, since the algorithm that produces the statistical difference "calculation of a new column" is unknown, the algorithm is set to "unknown" and the library name imported by "Phython 4" is used. Extract.
 同様に、実施例に係る情報処理装置は、グループD2において、「Data2.csv」と「Data1.csv」の統計的な差異として「値の数の増加」を特定する。また、実施例に係る情報処理装置は、グループD2において、「Data3.csv」と「Data2.csv」の統計的な差異として「値の範囲の変更」を特定する。また、実施例に係る情報処理装置は、グループD2において、「Data4.csv」と「Data3.csv」の統計的な差異として「新しい列の算出」を特定する。 Similarly, the information processing apparatus according to the embodiment specifies "increase in the number of values" as a statistical difference between "Data2.csv" and "Data1.csv" in group D2. Further, the information processing apparatus according to the embodiment specifies "change of value range" as a statistical difference between "Data3.csv" and "Data2.csv" in group D2. Further, the information processing apparatus according to the embodiment specifies "calculation of a new column" as a statistical difference between "Data4.csv" and "Data3.csv" in group D2.
 また、実施例に係る情報処理装置は、統計的な差異「値の数の増加」を生み出すアルゴリズムとして「補間」を特定する。また、実施例に係る情報処理装置は、統計的な差異「値の範囲の変更」を生み出すアルゴリズムとして「正規化」を特定する。また、実施例に係る情報処理装置は、統計的な差異「新しい列の算出」を生み出すアルゴリズムが不明のため、アルゴリズムを「不明」とし、「Python3」でインポートしているライブラリ名を抽出する。 In addition, the information processing apparatus according to the embodiment specifies "interpolation" as an algorithm that produces a statistical difference "increase in the number of values". Further, the information processing apparatus according to the embodiment specifies "normalization" as an algorithm that produces a statistical difference "change in the range of values". Further, in the information processing apparatus according to the embodiment, since the algorithm that produces the statistical difference "calculation of a new column" is unknown, the algorithm is set to "unknown" and the library name imported by "Phython 3" is extracted.
 グループA5とグループD2では、対応する統計的な差異が「値の数の増加」と「値の範囲の変更」と「新しい列の算出」で同じであり、対応するアルゴリズムは「補間」と「正規化」が同じである。また、統計的な差異「新しい列の算出」を生み出すアルゴリズムは「不明」であるため、実施例に係る情報処理装置は、ライブラリ名が一致している割合が0.8を超えているか否かを判定する。そして、ライブラリ名が一致している割合が0.8を超えている場合に、実施例に係る情報処理装置は、統計的な差異「新しい列の算出」が同じように生み出されたと判定し、グループA5とグループD2は一致すると判定する。 In groups A5 and D2, the corresponding statistical differences are the same for "increasing the number of values", "changing the range of values" and "calculating a new column", and the corresponding algorithms are "interpolation" and "interpolation". "Normalization" is the same. In addition, since the algorithm that produces the statistical difference "calculation of new column" is "unknown", whether or not the ratio of matching library names exceeds 0.8 in the information processing apparatus according to the embodiment. To judge. Then, when the ratio of matching library names exceeds 0.8, the information processing apparatus according to the embodiment determines that the statistical difference "calculation of a new column" has been created in the same manner. It is determined that the group A5 and the group D2 match.
 したがって、実施例に係る情報処理装置は、アルゴリズムが「補間→正規化→ライブラリ名の一致割合0.8超」であり、統計的な差異が「値の数の増加→値の範囲の変更→新しい列の算出」で表されるグループの頻出度に1を加える。そして、実施例に係る情報処理装置は、アルゴリズムとして「補間→正規化→ライブラリ名の一致割合0.8超」を、統計的な差異として「値の数の増加→値の範囲の変更→新しい列の算出」を、プロセス数として「3」を、データベースに保存する。また、実施例に係る情報処理装置は、A5のPythonプログラムとして「Python2→Python3→Python4」を、D2のPythonプログラムとして「Python1→Python2→Python3」を、データベースに保存する。また、実施例に係る情報処理装置は、頻出度として「1」をデータベースに保存する。 Therefore, in the information processing apparatus according to the embodiment, the algorithm is "interpolation-> normalization-> library name match ratio of more than 0.8", and the statistical difference is "increase in the number of values-> change the range of values->". Add 1 to the frequency of the group represented by "Calculation of new column". Then, the information processing apparatus according to the embodiment uses "interpolation-> normalization-> library name match ratio of more than 0.8" as an algorithm and "increases the number of values-> changes the range of values-> new" as a statistical difference. "Calculate column" and "3" as the number of processes are saved in the database. Further, the information processing apparatus according to the embodiment stores "Python2-> Python3-> Python4" as the A5 Python program and "Python1-> Python2-> Python3" as the D2 Python program in the database. Further, the information processing apparatus according to the embodiment stores "1" as the frequency of occurrence in the database.
 そして、実施例に係る情報処理装置は、同様の判定を繰り返していき、117個目の組み合わせ(グループの総当たりの最後の組み合わせ)として、図1Eに示すように、グループC3とグループD3を抽出し、同様の処理を行う。 Then, the information processing apparatus according to the embodiment repeats the same determination, and extracts group C3 and group D3 as the 117th combination (the last combination of group round robin) as shown in FIG. 1E. Then, perform the same processing.
 以上の処理の結果として、実施例に係る情報処理装置は、グループA1とグループB1、グループA4とグループD1、グループA6とグループD3、グループB6とグループC3、グループA5とグループD2の頻出度を1、他のグループの頻出度を0と特定する。 As a result of the above processing, the information processing apparatus according to the embodiment has a frequency of 1 for group A1 and group B1, group A4 and group D1, group A6 and group D3, group B6 and group C3, and group A5 and group D2. , The frequency of occurrence of other groups is specified as 0.
 そして、実施例に係る情報処理装置は、作成中のデータフローに対してデータベースを参照してリコメンドを行う。図1Fは、作成中のデータフローを示す図である。実施例に係る情報処理装置は、図1Fに示すデータフローUにおいて、「Data2.csv」と「Data1.csv」の統計的な差異として「値の数の増加」を特定する。また、実施例に係る情報処理装置は、データフローUにおいて、「Data3.csv」と「Data2.csv」の統計的な差異として「新しい列の算出」を特定する。また、実施例に係る情報処理装置は、統計的な差異「値の数の増加」を生み出すアルゴリズムとして「補間」を特定し、統計的な差異「新しい列の算出」を生み出すアルゴリズムを「不明」とし、「Python2」からインポートしているライブラリ名を抽出する。 Then, the information processing device according to the embodiment refers to the database and recommends the data flow being created. FIG. 1F is a diagram showing a data flow being created. The information processing apparatus according to the embodiment specifies "increase in the number of values" as a statistical difference between "Data2.csv" and "Data1.csv" in the data flow U shown in FIG. 1F. Further, the information processing apparatus according to the embodiment specifies "calculation of a new column" as a statistical difference between "Data3.csv" and "Data2.csv" in the data flow U. In addition, the information processing apparatus according to the embodiment specifies "interpolation" as an algorithm that produces a statistical difference "increase in the number of values", and "unknown" an algorithm that produces a statistical difference "calculation of a new column". Then, the name of the library being imported is extracted from "Phython2".
 データフローUのプロセス数は「2」であり、統計的差異は「値の数の増加」と「新し列の算出」であり、アルゴリズムは「補間」と「不明」である。このため、実施例に係る情報処理装置は、データベースに蓄積されたグループの中で、以下の条件を満たす最大のグループを特定する。
 ・3個以上のプロセスを有す
 ・頻出度が閾値以上(例えば、閾値は1)
 ・統計的差異に「値の数の増加」と「新し列の算出」を含み、アルゴリズムに「補間」と「不明」を含む
 ・「不明」のPythonプログラムがインポートしているライブラリの名前の一致割合が0.8を超える
The number of processes in the data flow U is "2", the statistical differences are "increase in the number of values" and "calculate new columns", and the algorithms are "interpolate" and "unknown". Therefore, the information processing apparatus according to the embodiment specifies the largest group that satisfies the following conditions among the groups stored in the database.
・ Has 3 or more processes ・ Frequency is above the threshold (for example, the threshold is 1)
-Statistical differences include "increase in number of values" and "calculate new columns", algorithms include "interpolation" and "unknown"-"Unknown" Python program imports the name of the library Match rate exceeds 0.8
 実施例に係る情報処理装置は、上記条件を満たす最大のグループとしてグループD2を特定し、D2のプロセスの中で作成中のデータフローにはないプロセスを実現するプログラムとして「Python2」を特定してリコメンドする。図1Gは、リコメンド画面を示す図である。図1Gに示すように、実施例に係る情報処理装置は、D2に基づいて、作成中のデータフローの「補間」と「不明」の間に、「正規化」を挿入することをリコメンドする。実施例に係る情報処理装置は、リコメンドするプロセスを、入出力データとともに、例えば、緑色の枠を付けて表示する。 The information processing apparatus according to the embodiment specifies group D2 as the largest group satisfying the above conditions, and specifies "Phython 2" as a program that realizes a process that is not in the data flow being created in the process of D2. Recommend. FIG. 1G is a diagram showing a recommendation screen. As shown in FIG. 1G, the information processing apparatus according to the embodiment recommends inserting "normalization" between "interpolation" and "unknown" of the data flow being created based on D2. The information processing apparatus according to the embodiment displays the recommended process together with the input / output data, for example, with a green frame.
 このように、実施例に係る情報処理装置は、参考となるデータフローをデータベースから検索して表示するので、データサイエンティストによるデータフローの作成を支援することができる。 In this way, the information processing apparatus according to the embodiment searches the database for a reference data flow and displays it, so that it is possible to support the creation of the data flow by the data scientist.
 次に、他のリコメンド例について図2A~図2Gを用いて説明する。図2Aは、データベースの作成に用いられる複数のデータフローを示す図である。この例では、データフローAA、データフローB、データフローC及びデータフローDDで表される4つのデータフローがデータベースの作成に用いられる。 Next, other recommended examples will be described with reference to FIGS. 2A to 2G. FIG. 2A is a diagram showing a plurality of data flows used for creating a database. In this example, four data flows represented by data flow AA, data flow B, data flow C and data flow DD are used to create the database.
 実施例に係る情報処理装置は、図2Bに示すように、データフローAAから、「Data1.csv→Python1→Data2.csv→Python2→Data3.csv」をグループAA1として抽出する。また、実施例に係る情報処理装置は、データフローBから、「Data1.csv→Python1→Data2.csv→Python2→Data3.csv」をグループB1として抽出する。 As shown in FIG. 2B, the information processing apparatus according to the embodiment extracts "Data1.csv-> Python1-> Data2.csv-> Python2-> Data3.csv" from the data flow AA as a group AA1. Further, the information processing apparatus according to the embodiment extracts "Data1.csv-> Python1-> Data2.csv-> Python2-> Data3.csv" from the data flow B as a group B1.
 そして、実施例に係る情報処理装置は、グループAA1において、「Data2.csv」と「Data1.csv」の統計的な差異として「行数の減少」を特定する。また、実施例に係る情報処理装置は、グループAA1において、「Data3.csv」と「Data2.csv」の統計的な差異として「値の数の増加」を特定する。また、実施例に係る情報処理装置は、統計的な差異「行数の減少」を生み出すアルゴリズムとして「削除」を特定し、統計的な差異「値の数の増加」を生み出すアルゴリズムとして「補間」を特定する。 Then, the information processing apparatus according to the embodiment specifies "decrease in the number of rows" as a statistical difference between "Data2.csv" and "Data1.csv" in group AA1. In addition, the information processing apparatus according to the embodiment specifies "increase in the number of values" as a statistical difference between "Data3.csv" and "Data2.csv" in group AA1. Further, the information processing apparatus according to the embodiment specifies "deletion" as an algorithm that produces a statistical difference "decrease in the number of rows", and "interpolates" as an algorithm that produces a statistical difference "increase in the number of values". To identify.
 同様に、実施例に係る情報処理装置は、グループB1において、「Data2.csv」と「Data1.csv」の統計的な差異として「行数の減少」を特定する。また、実施例に係る情報処理装置は、グループB1において、「Data3.csv」と「Data2.csv」の統計的な差異として「値の数の増加」を特定する。また、実施例に係る情報処理装置は、統計的な差異「行数の減少」を生み出すアルゴリズムとして「削除」を特定し、統計的な差異「値の数の増加」を生み出すアルゴリズムとして「補間」を特定する。 Similarly, the information processing apparatus according to the embodiment specifies "decrease in the number of rows" as a statistical difference between "Data2.csv" and "Data1.csv" in group B1. Further, the information processing apparatus according to the embodiment specifies "increase in the number of values" as a statistical difference between "Data3.csv" and "Data2.csv" in group B1. Further, the information processing apparatus according to the embodiment specifies "deletion" as an algorithm that produces a statistical difference "decrease in the number of rows", and "interpolates" as an algorithm that produces a statistical difference "increase in the number of values". To identify.
 グループAA1とグループB1では、対応する統計的な差異が「行数の減少」と「値の数の増加」で同じであり、対応するアルゴリズムも「削除」と「補間」で同じである。したがって、実施例に係る情報処理装置は、アルゴリズムが「削除→補間」であり、統計的な差異が「行数の減少→値の数の増加」で表されるグループの頻出度に1を加える。そして、実施例に係る情報処理装置は、アルゴリズムとして「削除→補間」を、統計的な差異として「行数の減少→値の数の増加」を、プロセス数として「2」を、頻出度として「1」を、データベースに保存する。また、実施例に係る情報処理装置は、グループAA1のPythonプログラムとして「Python1→Python2」を、グループB1のPythonプログラムとして「Python1→Python2」を、データベースに保存する。 In group AA1 and group B1, the corresponding statistical differences are the same for "decrease in the number of rows" and "increase in the number of values", and the corresponding algorithms are also the same for "delete" and "interpolation". Therefore, in the information processing apparatus according to the embodiment, 1 is added to the frequency of the group in which the algorithm is "deletion-> interpolation" and the statistical difference is "decrease in the number of rows-> increase in the number of values". .. Then, the information processing apparatus according to the embodiment uses "deletion-> interpolation" as an algorithm, "decrease in the number of rows-> increase in the number of values" as a statistical difference, and "2" as the number of processes as the frequency. Save "1" in the database. Further, the information processing apparatus according to the embodiment stores "Python1 → Python2" as the Python program of the group AA1 and "Phython1 → Python2" as the Python program of the group B1 in the database.
 実施例に係る情報処理装置は、次に、2個目の組み合わせとして、図2Cに示すように、グループAA1とグループB2を抽出する。グループAA1とグループB2は、プロセスの数が異なるため、実施例に係る情報処理装置は、異なるグループと判定して、次のグループを抽出する。そして、実施例に係る情報処理装置は、同様の判定を繰り返していき、130個目の組み合わせとして、図2Dに示すように、グループAA7とグループDD7を抽出する。 Next, the information processing apparatus according to the embodiment extracts group AA1 and group B2 as the second combination as shown in FIG. 2C. Since the number of processes is different between the group AA1 and the group B2, the information processing apparatus according to the embodiment determines that they are different groups and extracts the next group. Then, the information processing apparatus according to the embodiment repeats the same determination, and extracts group AA7 and group DD7 as the 130th combination as shown in FIG. 2D.
 そして、実施例に係る情報処理装置は、グループAA7において、「Data3.csv」と「Data2.csv」の統計的な差異として「値の数の増加」を特定する。また、実施例に係る情報処理装置は、グループAA7において、「Data4.csv」と「Data3.csv」の統計的な差異として「値の範囲の変更」を特定する。また、実施例に係る情報処理装置は、グループAA7において、「Data5.csv」と「Data4.csv」の統計的な差異として「新しい列の算出」を特定する。また、実施例に係る情報処理装置は、グループAA7において、統計的な差異として「出力ファイルなし」を特定する。 Then, the information processing apparatus according to the embodiment specifies "increase in the number of values" as a statistical difference between "Data3.csv" and "Data2.csv" in group AA7. Further, the information processing apparatus according to the embodiment specifies "change of value range" as a statistical difference between "Data4.csv" and "Data3.csv" in group AA7. Further, the information processing apparatus according to the embodiment specifies "calculation of a new column" as a statistical difference between "Data5.csv" and "Data4.csv" in group AA7. Further, the information processing apparatus according to the embodiment specifies "no output file" as a statistical difference in the group AA7.
 また、実施例に係る情報処理装置は、統計的な差異「値の数の増加」を生み出すアルゴリズムとして「補間」を特定する。また、実施例に係る情報処理装置は、統計的な差異「値の範囲の変更」を生み出すアルゴリズムとして「正規化」を特定する。また、実施例に係る情報処理装置は、統計的な差異「新しい列の算出」を生み出すアルゴリズムを「不明」とし、「Python4」でインポートしているライブラリ名を抽出する。また、実施例に係る情報処理装置は、「出力ファイルなし」を生み出すアルゴリズムとして「グラフ表示」を特定する。 In addition, the information processing apparatus according to the embodiment specifies "interpolation" as an algorithm that produces a statistical difference "increase in the number of values". Further, the information processing apparatus according to the embodiment specifies "normalization" as an algorithm that produces a statistical difference "change in the range of values". Further, the information processing apparatus according to the embodiment sets the algorithm that produces the statistical difference "calculation of a new column" to "unknown", and extracts the library name imported by "Phython 4". Further, the information processing apparatus according to the embodiment specifies "graph display" as an algorithm for producing "no output file".
 同様に、実施例に係る情報処理装置は、グループDD7において、「Data3.csv」と「Data2.csv」の統計的な差異として「値の数の増加」を特定する。また、実施例に係る情報処理装置は、グループDD7において、「Data4.csv」と「Data3.csv」の統計的な差異として「値の範囲の変更」を特定する。また、実施例に係る情報処理装置は、グループDD7において、「Data5.csv」と「Data4.csv」の統計的な差異として「新しい列の算出」を特定する。また、実施例に係る情報処理装置は、グループDD7において、統計的な差異として「出力ファイルなし」を特定する。 Similarly, the information processing apparatus according to the embodiment specifies "increase in the number of values" as a statistical difference between "Data3.csv" and "Data2.csv" in group DD7. Further, the information processing apparatus according to the embodiment specifies "change of value range" as a statistical difference between "Data4.csv" and "Data3.csv" in the group DD7. Further, the information processing apparatus according to the embodiment specifies "calculation of a new column" as a statistical difference between "Data5.csv" and "Data4.csv" in the group DD7. Further, the information processing apparatus according to the embodiment specifies "no output file" as a statistical difference in the group DD7.
 また、実施例に係る情報処理装置は、統計的な差異「値の数の増加」を生み出すアルゴリズムとして「補間」を特定する。また、実施例に係る情報処理装置は、統計的な差異「値の範囲の変更」を生み出すアルゴリズムとして「正規化」を特定する。また、実施例に係る情報処理装置は、統計的な差異「新しい列の算出」を生み出すアルゴリズムを「不明」とし、「Python4」でインポートしているライブラリ名を抽出する。また、実施例に係る情報処理装置は、「出力ファイルなし」を生み出すアルゴリズムとして「グラフ表示」を特定する。 In addition, the information processing apparatus according to the embodiment specifies "interpolation" as an algorithm that produces a statistical difference "increase in the number of values". Further, the information processing apparatus according to the embodiment specifies "normalization" as an algorithm that produces a statistical difference "change in the range of values". Further, the information processing apparatus according to the embodiment sets the algorithm that produces the statistical difference "calculation of a new column" to "unknown", and extracts the library name imported by "Phython 4". Further, the information processing apparatus according to the embodiment specifies "graph display" as an algorithm for producing "no output file".
 グループAA7とグループDD7では、対応する統計的な差異が「値の数の増加」と「値の範囲の変更」と「新しい列の算出」と「出力ファイルなし」で同じであり、対応するアルゴリズムは「補間」と「正規化」と「グラフ表示」が同じである。また、統計的な差異「新しい列の算出」を生み出すアルゴリズムは不明であるため、実施例に係る情報処理装置は、ライブラリ名が一致している割合が0.8を超えているか否かを判定する。そして、ライブラリ名が一致している割合が0.8を超えている場合に、実施例に係る情報処理装置は、統計的な差異「新しい列の算出」が同じように生み出されたと判定し、グループAA7とグループDD7は一致すると判定する。 In group AA7 and group DD7, the corresponding statistical differences are the same for "increase the number of values", "change the range of values", "calculate a new column" and "no output file", and the corresponding algorithms. "Interpolation", "normalization" and "graph display" are the same. Further, since the algorithm that produces the statistical difference "calculation of a new column" is unknown, the information processing apparatus according to the embodiment determines whether or not the ratio of matching library names exceeds 0.8. To do. Then, when the ratio of matching library names exceeds 0.8, the information processing apparatus according to the embodiment determines that the statistical difference "calculation of a new column" has been created in the same manner. It is determined that the group AA7 and the group DD7 match.
 したがって、実施例に係る情報処理装置は、アルゴリズムが「補間→正規化→ライブラリ名の一致割合0.8超→グラフ表示」で、統計的な差異が「値の数の増加→値の範囲の変更→新しい列の算出→出力ファイルなし」で表されるグループの頻出度に1を加える。そして、実施例に係る情報処理装置は、アルゴリズムとして「補間→正規化→ライブラリ名の一致割合0.8超→グラフ表示」を、統計的な差異として「値の数の増加→値の範囲の変更→新しい列の算出→出力ファイルなし」を、データベースに保存する。また、実施例に係る情報処理装置は、プロセス数として「4」を、AA7のPythonプログラムとして「Python2→Python3→Python4→Python5」を、データベースに保存する。また、実施例に係る情報処理装置は、DD7のPythonプログラムとして「Python2→Python3→Python4→Python5」を、頻出度として「1」を、データベースに保存する。 Therefore, in the information processing apparatus according to the embodiment, the algorithm is "interpolation-> normalization-> library name match ratio exceeding 0.8-> graph display", and the statistical difference is "increase in the number of values-> the range of values". Add 1 to the frequency of the group represented by "Change-> Calculate new column-> No output file". Then, the information processing apparatus according to the embodiment uses "interpolation-> normalization-> library name match ratio of more than 0.8-> graph display" as an algorithm, and "increase in the number of values-> value range" as a statistical difference. Change-> Calculate new column-> No output file "is saved in the database. Further, the information processing apparatus according to the embodiment stores "4" as the number of processes and "Python2 → Python3 → Python4 → Python5" as the Python program of AA7 in the database. Further, the information processing apparatus according to the embodiment stores "Python 2 → Python 3 → Python 4 → Python 5" as the Python program of DD7 and "1" as the frequency of occurrence in the database.
 そして、実施例に係る情報処理装置は、同様の判定を繰り返していき、298個目の組み合わせ(グループの総当たりの最後の組み合わせ)として、図2Eに示すように、グループC3とグループDD10を抽出し、同様の処理を行う。 Then, the information processing apparatus according to the embodiment repeats the same determination, and extracts group C3 and group DD10 as the 298th combination (the last combination of group round robin) as shown in FIG. 2E. Then, perform the same processing.
 以上の処理の結果として、実施例に係る情報処理装置は、グループAA1とグループB1とグループDD1の頻出度を3と特定する。また、実施例に係る情報処理装置は、グループAA2とグループDD2、グループAA3とグループDD3、グループAA4とグループDD4、グループAA5とグループDD5、グループAA6とグループDD6の頻出度を1と特定する。また、実施例に係る情報処理装置は、グループAA7とグループDD7、グループAA8とグループDD8、グループAA9とグループDD9、グループAA10とグループDD10、グループB6とグループC3の頻出度を1と特定する。また、実施例に係る情報処理装置は、他のグループの頻出度を0と特定する。 As a result of the above processing, the information processing apparatus according to the embodiment specifies that the frequency of occurrence of group AA1, group B1 and group DD1 is 3. Further, the information processing apparatus according to the embodiment specifies that the frequency of occurrence of group AA2 and group DD2, group AA3 and group DD3, group AA4 and group DD4, group AA5 and group DD5, and group AA6 and group DD6 is 1. Further, the information processing apparatus according to the embodiment specifies that the frequency of occurrence of group AA7 and group DD7, group AA8 and group DD8, group AA9 and group DD9, group AA10 and group DD10, and group B6 and group C3 is 1. Further, the information processing apparatus according to the embodiment specifies that the frequency of occurrence of other groups is 0.
 そして、実施例に係る情報処理装置は、作成中のデータフローに対してデータベースを参照してリコメンドを行う。図2Fは、作成中のデータフローを示す図である。実施例に係る情報処理装置は、図2Fに示すデータフローUにおいて、「Data2.csv」と「Data1.csv」の統計的な差異として「値の数の増加」を特定する。また、実施例に係る情報処理装置は、データフローUにおいて、「Data3.csv」と「Data2.csv」の統計的な差異として「新しい列の算出」を特定する。また、実施例に係る情報処理装置は、統計的な差異「値の数の増加」を生み出すアルゴリズムとして「補間」を特定し、統計的な差異「新しい列の算出」を生み出すアルゴリズムとして「不明」を特定する。そして、実施例に係る情報処理装置は、「Python2」からインポートしているライブラリ名を抽出する。 Then, the information processing device according to the embodiment refers to the database and recommends the data flow being created. FIG. 2F is a diagram showing a data flow being created. The information processing apparatus according to the embodiment specifies "increase in the number of values" as a statistical difference between "Data2.csv" and "Data1.csv" in the data flow U shown in FIG. 2F. Further, the information processing apparatus according to the embodiment specifies "calculation of a new column" as a statistical difference between "Data3.csv" and "Data2.csv" in the data flow U. In addition, the information processing apparatus according to the embodiment specifies "intertrusion" as an algorithm that produces a statistical difference "increase in the number of values", and "unknown" as an algorithm that produces a statistical difference "calculation of a new column". To identify. Then, the information processing apparatus according to the embodiment extracts the library name imported from "Phython 2".
 データフローUのプロセス数は「2」であり、統計的差異は「値の数の増加」と「新しい列の算出」であり、アルゴリズムは「補間」と「不明」である。このため、実施例に係る情報処理装置は、データベースに蓄積されたグループの中で、以下の条件を満たす最大のグループを特定する。
 ・3個以上のプロセスを有す
 ・頻出度が閾値以上(例えば、閾値は1)
 ・統計的差異に「値の数の増加」と「新しい列の算出」を含み、アルゴリズムに「補間」と「不明」を含む
 ・「不明」のPythonプログラムがインポートしているライブラリの名前の一致割合が0.8を超える
The number of processes in the data flow U is "2", the statistical differences are "increase in the number of values" and "calculate new columns", and the algorithms are "interpolate" and "unknown". Therefore, the information processing apparatus according to the embodiment specifies the largest group that satisfies the following conditions among the groups stored in the database.
・ Has 3 or more processes ・ Frequency is above the threshold (for example, the threshold is 1)
-Statistical differences include "increase in number of values" and "calculate new columns", algorithms include "interpolation" and "unknown"-Match names of libraries imported by "unknown" Python programs The ratio exceeds 0.8
 実施例に係る情報処理装置は、上記条件を満たす最大のグループとしてグループDD4を特定する。そして、実施例に係る情報処理装置は、DD4のプロセスの中で作成中のデータフローにはないプロセスを実現するプログラムとして「Python1」、「Python3」、「Python5」を特定してリコメンドする。図2Gは、リコメンド画面を示す図である。図2Gに示すように、実施例に係る情報処理装置は、DD4に基づいて、作成中のデータフローの「補間」の前に「削除」を、「補間」と「不明」の間に「正規化」を、「不明」の後に「グラフ表示」を挿入することをリコメンドする。 The information processing apparatus according to the embodiment specifies the group DD4 as the largest group satisfying the above conditions. Then, the information processing apparatus according to the embodiment identifies and recommends "Phython 1", "Phython 3", and "Phython 5" as a program that realizes a process that is not in the data flow being created in the DD4 process. FIG. 2G is a diagram showing a recommendation screen. As shown in FIG. 2G, the information processing apparatus according to the embodiment has "delete" before "interpolation" of the data flow being created, and "normal" between "interpolation" and "unknown" based on DD4. It is recommended to insert "Graph display" after "Unknown".
 このように、実施例に係る情報処理装置は、複数のプロセスをリコメンドすることで、データサイエンティストに複数の選択肢を提供することができる。 In this way, the information processing apparatus according to the embodiment can provide a plurality of options to the data scientist by recommending a plurality of processes.
 次に、実施例に係る情報処理装置の機能構成について説明する。図3は、実施例に係る情報処理装置の機能構成を示す図である。図3に示すように、実施例に係る情報処理装置10は、データフロー記憶部11と、グループ抽出部12と、グループ記憶部13と、頻出度計算部14と、データベース15とを有する。また、実施例に係る情報処理装置10は、作成フロー記憶部16と、作成メタ情報記憶部17と、検索部18と、表示部19とを有する。 Next, the functional configuration of the information processing device according to the embodiment will be described. FIG. 3 is a diagram showing a functional configuration of the information processing apparatus according to the embodiment. As shown in FIG. 3, the information processing apparatus 10 according to the embodiment includes a data flow storage unit 11, a group extraction unit 12, a group storage unit 13, a frequency calculation unit 14, and a database 15. Further, the information processing device 10 according to the embodiment includes a creation flow storage unit 16, a creation meta information storage unit 17, a search unit 18, and a display unit 19.
 データフロー記憶部11は、複数のデータフローのグラフ構造の情報を記憶する。情報処理装置10は、例えば、ユーザがマウスを用いて行った指示を受け付けてファイルからデータフローのグラフ構造の情報を読み出してデータフロー記憶部11に格納したり追加したりする。 The data flow storage unit 11 stores information on the graph structure of a plurality of data flows. For example, the information processing device 10 receives an instruction given by the user using the mouse, reads out information on the graph structure of the data flow from the file, and stores or adds it to the data flow storage unit 11.
 図4は、データフロー記憶部11の一例を示す図である。図4に示すように、データフロー記憶部11は、データフローを識別するデータフロー名とデータフローのグラフ構造の情報を対応付けて記憶する。データフロー記憶部11は、例えば、データフローAについて、「Data1.csv→Python1→Data2.csv」、「Data2.csv→Python2→Data3.csv」を記憶する。また、データフロー記憶部11は、データフローAについて、「Data3.csv→Python3→Data4.csv」、「Data4.csv→Python4→Data5.csv」を記憶する。 FIG. 4 is a diagram showing an example of the data flow storage unit 11. As shown in FIG. 4, the data flow storage unit 11 stores the data flow name that identifies the data flow and the information of the graph structure of the data flow in association with each other. The data flow storage unit 11 stores, for example, "Data1.csv-> Python1-> Data2.csv" and "Data2.csv-> Python2-> Data3.csv" for the data flow A. Further, the data flow storage unit 11 stores "Data3.csv-> Python3-> Data4.csv" and "Data4.csv-> Python4-> Data5.csv" for the data flow A.
 グループ抽出部12は、データフロー記憶部11が記憶する情報を用いて全てのグループを抽出し、各グループについて、メタデータを特定して、グループ記憶部13に格納する。統計的な差異とアルゴリズム以外のメタデータとしては、グループが抽出されたデータフローに付加された説明文、データやプロセスのファイル名、データやプロセスのプロパティ情報、入出力ファイルの列名、プロセスのID等がある。あるいは、グループ抽出部12は、ユーザからグループの説明文を受け付けてメタデータとして付加してもよい。グループ抽出部12は、各グループについて、複数のメタデータを特定してもよい。 The group extraction unit 12 extracts all the groups using the information stored in the data flow storage unit 11, specifies the metadata for each group, and stores it in the group storage unit 13. Metadata other than statistical differences and algorithms includes descriptive text added to the data flow from which the group was extracted, data and process filenames, data and process property information, I / O file column names, and process There is an ID etc. Alternatively, the group extraction unit 12 may receive a description of the group from the user and add it as metadata. The group extraction unit 12 may specify a plurality of metadata for each group.
 グループ記憶部13は、グループのメタデータを記憶する。図5は、グループ記憶部13の一例を示す図である。図5は、メタデータが統計的な差異とアルゴリズムである場合を示す。図5に示すように、グループ記憶部13は、グループを識別するグループNo.に対応付けて、アルゴリズムと統計的な差異とを記憶する。例えば、グループ記憶部13は、グループA1について、アルゴリズムとして「削除→補間」を記憶し、統計的な差異として「行数の減少→値の数の増加」を記憶する。 The group storage unit 13 stores the group metadata. FIG. 5 is a diagram showing an example of the group storage unit 13. FIG. 5 shows the case where the metadata is statistical differences and algorithms. As shown in FIG. 5, the group storage unit 13 stores the algorithm and the statistical difference in association with the group No. that identifies the group. For example, the group storage unit 13 stores "deletion-> interpolation" as an algorithm for group A1 and "decrease in the number of rows-> increase in the number of values" as a statistical difference.
 頻出度計算部14は、グループの頻出度を計算し、グループの情報に対応付けてデータベース15に格納する。頻出度計算部14は、2つのグループでメタデータが類似すると、頻出度に1を加える。例えば、頻出度計算部14は、メタデータごとに類似度を定義して、類似度が所定の閾値以上の場合に、メタデータが類似すると判定する。複数のメタデータを用いる場合には、頻出度計算部14は、例えば、1つのメタデータが類似するごとに頻出度に1を加える。 The frequency calculation unit 14 calculates the frequency of the group and stores it in the database 15 in association with the group information. The frequency calculation unit 14 adds 1 to the frequency when the metadata is similar in the two groups. For example, the frequency calculation unit 14 defines the similarity for each metadata, and determines that the metadata is similar when the similarity is equal to or greater than a predetermined threshold value. When a plurality of metadata are used, the frequency calculation unit 14 adds 1 to the frequency every time one metadata is similar, for example.
 例えば、メタデータが統計的な差異とアルゴリズムである場合、頻出度計算部14は、アルゴリズムと統計的な差異が同じ場合に、頻出度に1を加える。なお、アルゴリズムが「不明」である場合には、頻出度計算部14は、対応するプロセスを実現するプログラムからライブラリ名を取得し、2つのグループで、ライブラリ名が一致する割合が0.8を超えていれば、「不明」に関してアルゴリズムが一致するとする。ここで、0.8は、閾値の例であり、他の値でもよい。 For example, when the metadata is a statistical difference and an algorithm, the frequency calculation unit 14 adds 1 to the frequency when the algorithm and the statistical difference are the same. When the algorithm is "unknown", the frequency calculation unit 14 obtains the library name from the program that realizes the corresponding process, and the ratio of matching library names in the two groups is 0.8. If it exceeds, the algorithm matches for "unknown". Here, 0.8 is an example of the threshold value, and other values may be used.
 データベース15は、グループの情報と頻出度を対応付けて記憶し、データフロー作成の際に参照される。図6は、データベース15の一例を示す図である。図6は、メタデータが統計的な差異とアルゴリズムである場合を示す。図6に示すように、データベース15は、アルゴリズムと、統計的な差異と、プロセス数と、頻出度と、グループ名と、プログラム名を対応付けて記憶する。 The database 15 stores group information in association with the frequency of occurrence and is referred to when creating a data flow. FIG. 6 is a diagram showing an example of the database 15. FIG. 6 shows the case where the metadata is statistical differences and algorithms. As shown in FIG. 6, the database 15 stores the algorithm, the statistical difference, the number of processes, the frequency of occurrence, the group name, and the program name in association with each other.
 プロセス数は、グループに含まれるプロセスの数である。グループ名は、アルゴリズムと統計的な差異で特定されるグループを識別する名前である。アルゴリズムと統計的な差異が同じグループが複数ある場合には、グループ名は複数になる。プログラム名は、グループ名に対応付けられ、グループ名で識別されるグループに含まれるプロセスを実現するプログラムである。 The number of processes is the number of processes included in the group. The group name is a name that identifies the group identified by the algorithm and statistical differences. If there are multiple groups with the same statistical difference from the algorithm, the group name will be multiple. The program name is a program that realizes the process included in the group that is associated with the group name and is identified by the group name.
 例えば、「行数の減少→値の数の増加」と「削除→補間」で特定されるグループA1及びB1のプロセスの数は「2」であり、頻出度は「1」である。グループA1の処理は、データフローAの「Python1」と「Python2」を「Python1→Python2」の順に実行することで実現される。 For example, the number of processes in groups A1 and B1 specified by "decrease in the number of rows → increase in the number of values" and "deletion → interpolation" is "2", and the frequency is "1". The processing of the group A1 is realized by executing "Phython 1" and "Phython 2" of the data flow A in the order of "Phython 1 → Python 2".
 作成フロー記憶部16は、ユーザが作成中のデータフローのグラフ構造の情報を記憶する。情報処理装置10は、例えば、ユーザがマウスやキーボードを用いて作成中のデータフローのグラフ構造の情報を作成フロー記憶部16に格納する。 The creation flow storage unit 16 stores information on the graph structure of the data flow being created by the user. The information processing device 10 stores, for example, information on the graph structure of the data flow being created by the user using a mouse or keyboard in the creation flow storage unit 16.
 図7は、作成フロー記憶部16の一例を示す図である。図7に示すように、作成フロー記憶部16は、作成中のデータフローのグラフ構造の要素を識別する番号であるNo.と要素のグラフ構造とを対応付けて記憶する。ここで、要素は1つのプロセスとその入力データ及び出力データのグラフ構造である。例えば、識別する番号が「1」である要素のグラフ構造は「Data1.csv→Python1→Data2.csv」である。 FIG. 7 is a diagram showing an example of the creation flow storage unit 16. As shown in FIG. 7, the creation flow storage unit 16 has a number that identifies an element of the graph structure of the data flow being created. And the graph structure of the element are associated and stored. Here, the element is a graph structure of one process and its input data and output data. For example, the graph structure of the element whose identification number is "1" is "Data1.csv-> Phython1-> Data2.csv".
 グループ抽出部12は、作成フロー記憶部16が記憶する情報を用いて作成中のデータフローのメタデータを特定して、作成メタ情報記憶部17に格納する。作成メタ情報記憶部17は、作成中のデータフローのメタデータを記憶する。 The group extraction unit 12 identifies the metadata of the data flow being created using the information stored in the creation flow storage unit 16 and stores it in the creation meta information storage unit 17. The creation meta information storage unit 17 stores the metadata of the data flow being created.
 検索部18は、作成メタ情報記憶部17が記憶するメタデータを用いて、作成中のデータフローに最も類似するグループをデータベース15から検索する。なお、情報処理装置10は、作成中のデータフローに最も類似するグループの代わりに、類似するグループを検索してもよい。 The search unit 18 searches the database 15 for the group most similar to the data flow being created, using the metadata stored in the created meta information storage unit 17. The information processing device 10 may search for a similar group instead of the group most similar to the data flow being created.
 例えば、メタデータが統計的な差異とアルゴリズムである場合、検索部18は、最も類似するグループとして、以下の条件を満たす最大のグループをデータベース15から検索する。
 ・作成中のデータフローのプロセス数より多くのプロセスを有す
 ・頻出度が閾値以上(例えば、閾値は1)
 ・作成メタ情報記憶部17が記憶する統計的な差異とアルゴリズムを含む
 ・「不明」に対応するPythonプログラムがインポートしているライブラリの名前の一致割合が0.8を超える
For example, when the metadata is a statistical difference and an algorithm, the search unit 18 searches the database 15 for the largest group that satisfies the following conditions as the most similar group.
-Has more processes than the number of processes in the data flow being created-Frequency is above the threshold (for example, the threshold is 1)
-Includes statistical differences and algorithms stored in the created meta information storage unit 17.-The match ratio of the names of the libraries imported by the Python program corresponding to "Unknown" exceeds 0.8.
 なお、検索部18は、ライブラリ名が一致する割合を0.8より小さくすることで類似するグループを特定してもよい。あるいは、検索部18は、1つのアルゴリズムだけを除いて作成中のデータフローの統計的な差異とアルゴリズムを含むグループを類似するグループとして特定してもよい。検索部18は、作成中のデータフローにないプロセス及び当該プロセスの作成中のデータフローにおける位置を特定する。 Note that the search unit 18 may specify similar groups by making the ratio of matching library names less than 0.8. Alternatively, the search unit 18 may specify the statistical difference of the data flow being created except for one algorithm and the group including the algorithm as a similar group. The search unit 18 identifies a process that is not in the data flow being created and a position in the data flow that the process is being created.
 表示部19は、検索部18が特定した位置にプロセスを実現するプログラムと入出力データをリコメンド情報として出力し、図示しない表示装置に表示させる。また、情報処理装置10は、プリンタ用出力部を介してリコメンド情報をプリンタに出力してもよい。図1G及び図2Gは、表示部19による表示例を示す。 The display unit 19 outputs the program that realizes the process and the input / output data as recommendation information at the position specified by the search unit 18, and displays them on a display device (not shown). Further, the information processing device 10 may output the recommendation information to the printer via the printer output unit. 1G and 2G show a display example by the display unit 19.
 次に、情報処理装置10による処理のフローについて説明する。図8は、情報処理装置10による処理のフローを示すフローチャートである。図8において、ステップS1~ステップS5は、データベース15を作成する処理であり、ステップS6~ステップS9は、リコメンドすべき追加するプロセスを抽出する処理である。 Next, the processing flow by the information processing device 10 will be described. FIG. 8 is a flowchart showing a processing flow by the information processing apparatus 10. In FIG. 8, steps S1 to S5 are processes for creating the database 15, and steps S6 to S9 are processes for extracting additional processes to be recommended.
 図8に示すように、情報処理装置10は、2つのデータフローの連続する部分をグルーピングする(ステップS1)。ここで、グループには、2つ以上のプロセスと2つ以上のプロセスの先頭のプロセスの入力データから最後のプロセスの出力データまでのデータとが含まれる。なお、最後のプロセスの出力データはない場合もある。 As shown in FIG. 8, the information processing apparatus 10 groups continuous portions of two data flows (step S1). Here, the group includes two or more processes and data from the input data of the first process of the two or more processes to the output data of the last process. Note that there may be no output data for the last process.
 そして、情報処理装置10は、2つのグループのメタデータを特定する(ステップS2)。そして、情報処理装置10は、2つのグループのメタデータが類似していれば、グループの類似度を+1する(ステップS3)。 Then, the information processing device 10 identifies the metadata of the two groups (step S2). Then, if the metadata of the two groups are similar, the information processing apparatus 10 increments the similarity of the groups by +1 (step S3).
 そして、情報処理装置10は、データベース15に登録されていない場合には、メタデータと頻出度をデータベース15に登録する(ステップS4)。そして、情報処理装置10は、全てのデータフローと全てのグルーピングの組み合わせで頻出度を求めたか否かを判定し(ステップS5)、類似度を求めていない組み合せがある場合には、ステップS1に戻る。 Then, when the information processing apparatus 10 is not registered in the database 15, the metadata and the frequency of occurrence are registered in the database 15 (step S4). Then, the information processing apparatus 10 determines whether or not the frequency is obtained by combining all the data flows and all the groupings (step S5), and if there is a combination for which the similarity is not obtained, the step S1 is performed. go back.
 一方、全てのデータフローと全てのグルーピングの組み合わせで頻出度を求めた場合には、情報処理装置10は、作成中のデータフローについて、メタデータを特定する(ステップS6)。そして、情報処理装置10は、データベース15から、頻出度が所定の閾値以上のグループを抽出し、抽出したグループの中から、作成中のデータフローよりもプロセス数が1つ以上多いグループを選択する(ステップS7)。 On the other hand, when the frequency is obtained from the combination of all data flows and all groupings, the information processing apparatus 10 specifies metadata for the data flow being created (step S6). Then, the information processing apparatus 10 extracts a group having a frequency of frequency equal to or higher than a predetermined threshold value from the database 15, and selects a group having one or more processes more than the data flow being created from the extracted groups. (Step S7).
 そして、情報処理装置10は、選択したグループの中から、作成中のデータフローとメタデータが最も類似するグループを特定する(ステップS8)。そして、情報処理装置10は、特定したグループから、作成中のデータフローにないプロセス及び当該プロセスの作成中のデータフローにおける位置を特定し、特定した位置にプロセスを実現するプログラムと入出力データを表示する(ステップS9)。 Then, the information processing apparatus 10 identifies a group whose metadata is most similar to the data flow being created from the selected groups (step S8). Then, the information processing apparatus 10 identifies a process that is not in the data flow being created and a position in the data flow that is being created from the specified group, and outputs a program and input / output data that realizes the process to the specified position. Display (step S9).
 上述してきたように、実施例では、データベース15が、グループの情報を記憶する。そして、検索部18が、作成対象のデータフローと最も類似するグループをデータベース15から検索する。そして、表示部19が、検索部18により検索されたグループから作成対象のデータフローと相違するプロセスを抽出して表示する。したがって、情報処理装置10は、ユーザのデータフロー作成を支援することができる。 As described above, in the embodiment, the database 15 stores the group information. Then, the search unit 18 searches the database 15 for the group most similar to the data flow to be created. Then, the display unit 19 extracts and displays a process different from the data flow to be created from the group searched by the search unit 18. Therefore, the information processing device 10 can support the user's data flow creation.
 また、実施例では、データベース15は、グループから特定されるメタデータを各グループについて記憶する。そして、検索部18は、作成対象のデータフローよりプロセス数が多く、作成対象のデータフローとメタデータが最も類似するグループをデータベース15から検索する。したがって、情報処理装置10は、データフロー作成の参考となるグループを適切に検索することができる。 Further, in the embodiment, the database 15 stores the metadata specified from the groups for each group. Then, the search unit 18 searches the database 15 for a group having a larger number of processes than the data flow to be created and having the most similar metadata to the data flow to be created. Therefore, the information processing apparatus 10 can appropriately search for a group that serves as a reference for creating a data flow.
 また、実施例では、頻出度計算部14が、他のデータフローのグループと類似するか否かに基づいてグループの頻出度を計算し、データベース15は、グループに対応付けて頻出度を記憶する。そして、検索部18は、頻出度が所定の閾値以上のグループをデータベース15から検索する。したがって、情報処理装置10は、データフロー作成の参考となるグループとして、使われる頻度が高いグループを検索することができる。 Further, in the embodiment, the frequency calculation unit 14 calculates the frequency of the group based on whether or not it is similar to the group of other data flows, and the database 15 stores the frequency in association with the group. .. Then, the search unit 18 searches the database 15 for a group whose frequency is equal to or higher than a predetermined threshold value. Therefore, the information processing apparatus 10 can search for a group that is frequently used as a reference group for creating a data flow.
 また、実施例では、データベース15は、メタデータとして統計的な差異とアルゴリズムを記憶する。そして、検索部18は、最も類似するグループとして、作成対象のデータフローよりプロセス数が多く、作成対象のデータフローの統計的な差異とアルゴリズムを含み、処理の数が最も大きいグループをデータベース15から検索する。したがって、情報処理装置10は、最も類似するグループを適切に検索することができる。 Further, in the embodiment, the database 15 stores statistical differences and algorithms as metadata. Then, as the most similar group, the search unit 18 has a larger number of processes than the data flow to be created, includes statistical differences and algorithms of the data flow to be created, and selects the group with the largest number of processes from the database 15. search for. Therefore, the information processing apparatus 10 can appropriately search for the most similar group.
 また、実施例では、プロセスのアルゴリズムが「不明」である場合に、検索部18は、プロセスを実現するプログラムがインポートするライブラリの一致割合が0.8を超える場合に、アルゴリズムが一致すると判定する。したがって、検索部18は、アルゴリズムが不明である場合には、アルゴリズムが一致するか否かを判定することができる。 Further, in the embodiment, when the algorithm of the process is "unknown", the search unit 18 determines that the algorithms match when the match ratio of the libraries imported by the program realizing the process exceeds 0.8. .. Therefore, when the algorithm is unknown, the search unit 18 can determine whether or not the algorithms match.
 なお、実施例では、情報処理装置10について説明したが、情報処理装置10が有する構成をソフトウェアによって実現することで、同様の機能を有するデータフロー作成プログラムを得ることができる。そこで、データフロー作成プログラムを実行するコンピュータについて説明する。 Although the information processing device 10 has been described in the embodiment, a data flow creation program having the same function can be obtained by realizing the configuration of the information processing device 10 by software. Therefore, a computer that executes the data flow creation program will be described.
 図9は、実施例に係るデータフロー作成プログラムを実行するコンピュータのハードウェア構成を示す図である。図9に示すように、コンピュータ50は、メインメモリ51と、CPU(Central Processing Unit)52と、LAN(Local Area Network)インタフェース53と、HDD(Hard Disk Drive)54とを有する。また、コンピュータ50は、スーパーIO(Input Output)55と、DVI(Digital Visual Interface)56と、ODD(Optical Disk Drive)57とを有する。 FIG. 9 is a diagram showing a hardware configuration of a computer that executes a data flow creation program according to an embodiment. As shown in FIG. 9, the computer 50 has a main memory 51, a CPU (Central Processing Unit) 52, a LAN (Local Area Network) interface 53, and an HDD (Hard Disk Drive) 54. Further, the computer 50 has a super IO (Input Output) 55, a DVI (Digital Visual Interface) 56, and an ODD (Optical Disk Drive) 57.
 メインメモリ51は、プログラムやプログラムの実行途中結果等を記憶するメモリである。CPU52は、メインメモリ51からプログラムを読み出して実行する中央処理装置である。CPU52は、メモリコントローラを有するチップセットを含む。 The main memory 51 is a memory for storing a program, a result during execution of the program, and the like. The CPU 52 is a central processing unit that reads a program from the main memory 51 and executes it. The CPU 52 includes a chipset having a memory controller.
 LANインタフェース53は、コンピュータ50をLAN経由で他のコンピュータに接続するためのインタフェースである。HDD54は、プログラムやデータを格納するディスク装置であり、スーパーIO55は、マウスやキーボード等の入力装置を接続するためのインタフェースである。DVI56は、液晶表示装置を接続するインタフェースであり、ODD57は、DVD、CD-Rの読み書きを行う装置である。 The LAN interface 53 is an interface for connecting the computer 50 to another computer via a LAN. The HDD 54 is a disk device for storing programs and data, and the super IO 55 is an interface for connecting an input device such as a mouse or a keyboard. The DVI 56 is an interface for connecting a liquid crystal display device, and the ODD 57 is a device for reading and writing DVDs and CD-Rs.
 LANインタフェース53は、PCIエクスプレス(PCIe)によりCPU52に接続され、HDD54及びODD57は、SATA(Serial Advanced Technology Attachment)によりCPU52に接続される。スーパーIO55は、LPC(Low Pin Count)によりCPU52に接続される。 The LAN interface 53 is connected to the CPU 52 by PCI Express (PCIe), and the HDD 54 and ODD 57 are connected to the CPU 52 by SATA (Serial Advanced Technology Attachment). The super IO 55 is connected to the CPU 52 by LPC (Low Pin Count).
 そして、コンピュータ50において実行されるデータ処理プログラムは、コンピュータ50により読み出し可能な記録媒体の一例であるCD-Rに記憶され、ODD57によってCD-Rから読み出されてコンピュータ50にインストールされる。あるいは、データ処理プログラムは、LANインタフェース53を介して接続された他のコンピュータシステムのデータベース等に記憶され、これらのデータベースから読み出されてコンピュータ50にインストールされる。そして、インストールされたデータ処理プログラムは、HDD54に記憶され、メインメモリ51に読み出されてCPU52によって実行される。 Then, the data processing program executed by the computer 50 is stored in the CD-R, which is an example of the recording medium readable by the computer 50, read from the CD-R by the ODD 57, and installed in the computer 50. Alternatively, the data processing program is stored in a database or the like of another computer system connected via the LAN interface 53, read from these databases, and installed in the computer 50. Then, the installed data processing program is stored in the HDD 54, read into the main memory 51, and executed by the CPU 52.
 10  情報処理装置
 11  データフロー記憶部
 12  グループ抽出部
 13  グループ記憶部
 14  頻出度計算部
 15  データベース
 16  作成フロー記憶部
 17  作成メタ情報記憶部
 18  検索部
 19  表示部
 50  コンピュータ
 51  メインメモリ
 52  CPU
 53  LANインタフェース
 54  HDD
 55  スーパーIO
 56  DVI
 57  ODD
10 Information processing device 11 Data flow storage unit 12 Group extraction unit 13 Group storage unit 14 Frequency calculation unit 15 Database 16 Creation flow storage unit 17 Creation meta information storage unit 18 Search unit 19 Display unit 50 Computer 51 Main memory 52 CPU
53 LAN interface 54 HDD
55 Super IO
56 DVI
57 ODD

Claims (6)

  1.  処理と、処理に使われるデータ及び処理結果として得られるデータとを要素として含む一連のデータフローを蓄積するデータベースと、
     作成対象のデータフローに類似するデータフローを前記データベースから抽出する抽出部と、
     前記抽出部により抽出されたデータフローから作成対象のデータフローと相違する要素を抽出し、抽出した要素を出力する出力部と
     を有することを特徴とする情報処理装置。
    A database that stores a series of data flows that include processing, data used for processing, and data obtained as a result of processing as elements.
    An extraction unit that extracts a data flow similar to the data flow to be created from the database,
    An information processing apparatus having an output unit that extracts elements different from the data flow to be created from the data flow extracted by the extraction unit and outputs the extracted elements.
  2.  前記データベースは、データフローから特定されるメタデータを各データフローについて記憶し、
     前記抽出部は、作成対象のデータフローに類似するデータフローとして、作成対象のデータフローより処理数が多く、作成対象のデータフローのメタデータが類似するデータフローを前記データベースから抽出し、
     前記出力部は、前記抽出部により抽出されたデータフローから作成対象のデータフローの処理に含まれない処理を抽出し、抽出した処理を出力する
     ことを特徴とする請求項1に記載の情報処理装置。
    The database stores the metadata identified from the data flow for each data flow.
    As a data flow similar to the data flow to be created, the extraction unit extracts from the database a data flow having a larger number of processes than the data flow to be created and having similar metadata of the data flow to be created.
    The information processing according to claim 1, wherein the output unit extracts a process not included in the process of the data flow to be created from the data flow extracted by the extraction unit, and outputs the extracted process. apparatus.
  3.  他のデータフローと類似するか否かに基づいて頻出度を計算する計算部をさらに有し、
     前記データベースは、データフローに対応付けて前記頻出度を記憶し、
     前記抽出部は、前記頻出度が第1閾値以上のデータフローを前記データベースから抽出する
     ことを特徴とする請求項1又は2に記載の情報処理装置。
    It also has a calculator that calculates the frequency based on whether it is similar to other data flows.
    The database stores the frequency of occurrence in association with the data flow.
    The information processing apparatus according to claim 1 or 2, wherein the extraction unit extracts a data flow having a frequency of 1st threshold value or more from the database.
  4.  前記データベースは、データフローのメタデータとして統計的な差異とアルゴリズムを記憶し、
     前記抽出部は、作成対象のデータフローに類似するデータフローとして、作成対象のデータフローより処理数が多く、作成対象のデータフローの統計的な差異とアルゴリズムを含み、処理の数が最も大きいデータフローを前記データベースから抽出する
     ことを特徴とする請求項2に記載の情報処理装置。
    The database stores statistical differences and algorithms as data flow metadata.
    As a data flow similar to the data flow to be created, the extraction unit has a larger number of processes than the data flow to be created, includes statistical differences and algorithms of the data flow to be created, and has the largest number of processes. The information processing apparatus according to claim 2, wherein the flow is extracted from the database.
  5.  前記データベースは、処理を実現するプログラムを識別するプログラム名を記憶し、
     前記抽出部は、アルゴリズムが不明の処理がある場合に、前記プログラム名を用いて前記プログラムがインポートするライブラリの名前を特定し、名前の一致する割合が第2閾値を超えているとアルゴリズムが一致すると判定する
     ことを特徴とする請求項4に記載の情報処理装置。
    The database stores a program name that identifies a program that realizes processing, and stores the program name.
    The extraction unit identifies the name of the library to be imported by the program using the program name when the algorithm is unknown, and the algorithm matches when the matching ratio of the names exceeds the second threshold value. The information processing apparatus according to claim 4, wherein the information processing apparatus is determined to be so.
  6.  コンピュータに、
     処理と、処理に使われるデータ及び処理結果として得られるデータとを要素として含む一連のデータフローを蓄積するデータベースから、作成対象のデータフローに類似するデータフローを抽出し、
     抽出したデータフローから作成対象のデータフローと相違する要素を抽出し、抽出した要素を出力する
     動作を行わせることを特徴とするデータフロー作成プログラム。
    On the computer
    A data flow similar to the data flow to be created is extracted from a database that stores a series of data flows including the processing and the data used for the processing and the data obtained as the processing result as elements.
    A data flow creation program characterized in that elements different from the data flow to be created are extracted from the extracted data flow and the extracted elements are output.
PCT/JP2019/034153 2019-08-30 2019-08-30 Information processing device, and data flow creating program WO2021038835A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/JP2019/034153 WO2021038835A1 (en) 2019-08-30 2019-08-30 Information processing device, and data flow creating program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2019/034153 WO2021038835A1 (en) 2019-08-30 2019-08-30 Information processing device, and data flow creating program

Publications (1)

Publication Number Publication Date
WO2021038835A1 true WO2021038835A1 (en) 2021-03-04

Family

ID=74683408

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2019/034153 WO2021038835A1 (en) 2019-08-30 2019-08-30 Information processing device, and data flow creating program

Country Status (1)

Country Link
WO (1) WO2021038835A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003044480A (en) * 2002-04-30 2003-02-14 Hitachi Software Eng Co Ltd Method for extracting specified article data
JP2010237960A (en) * 2009-03-31 2010-10-21 Nec Corp Inquiry reply support device, inquiry reply support system and method, and reply support program
WO2018011895A1 (en) * 2016-07-12 2018-01-18 株式会社日立製作所 Data processing flow management system and method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003044480A (en) * 2002-04-30 2003-02-14 Hitachi Software Eng Co Ltd Method for extracting specified article data
JP2010237960A (en) * 2009-03-31 2010-10-21 Nec Corp Inquiry reply support device, inquiry reply support system and method, and reply support program
WO2018011895A1 (en) * 2016-07-12 2018-01-18 株式会社日立製作所 Data processing flow management system and method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
SHINYAMA, YUSUKE ET AL.: "Implementation of a Data Flow Graph Extractor for Pattern Matching", IEICE TECHNICAL REPORT, vol. 117, no. 249, 12 October 2017 (2017-10-12), pages 37 - 42, ISSN: 0913-5685 *

Similar Documents

Publication Publication Date Title
US10685044B2 (en) Identification and management system for log entries
US10706103B2 (en) System and method for hierarchical distributed processing of large bipartite graphs
US20170039198A1 (en) Visual interactive search, scalable bandit-based visual interactive search and ranking for visual interactive search
US9519685B1 (en) Tag selection, clustering, and recommendation for content hosting services
CN101566997B (en) Determining words related to given set of words
CN101404015B (en) Automatically generating a hierarchy of terms
CN106383836B (en) Attributing actionable attributes to data describing an identity of an individual
US20060184572A1 (en) Sampling method for estimating co-occurrence counts
US20130138638A1 (en) Temporal visualization of query results
US20230306035A1 (en) Automatic recommendation of analysis for dataset
US20100332568A1 (en) Media Playlists
CN105005616A (en) Text illustration method and system based on text image characteristics for interaction expansion
US8510307B1 (en) Systems and methods for automatic item classification
JP5844824B2 (en) SPARQL query optimization method
AU2020104435A4 (en) Method and apparatus for video recommendation, and refrigerator with screen
CN103984754A (en) Search system and search method
US10120929B1 (en) Systems and methods for automatic item classification
WO2021038835A1 (en) Information processing device, and data flow creating program
US10984005B2 (en) Database search apparatus and method of searching databases
JP4594992B2 (en) Document data classification device, document data classification method, program thereof, and recording medium
CN116010670A (en) Data catalog recommendation method, device and application based on data blood relationship
JP4936455B2 (en) Document classification apparatus, document classification method, program, and recording medium
JP7103433B2 (en) Information processing equipment and lineage program
JP2012243013A (en) Data extraction system for data analysis, method, and program
CN113420214B (en) Electronic transaction object recommendation method, device and equipment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19942725

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19942725

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: JP