WO2024093253A1 - Data sampling method and related device - Google Patents

Data sampling method and related device Download PDF

Info

Publication number
WO2024093253A1
WO2024093253A1 PCT/CN2023/100937 CN2023100937W WO2024093253A1 WO 2024093253 A1 WO2024093253 A1 WO 2024093253A1 CN 2023100937 W CN2023100937 W CN 2023100937W WO 2024093253 A1 WO2024093253 A1 WO 2024093253A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
column
attribute
sample set
sample
Prior art date
Application number
PCT/CN2023/100937
Other languages
French (fr)
Chinese (zh)
Inventor
陈肇强
魏子恒
王浩宇
宋韶旭
Original Assignee
华为云计算技术有限公司
清华大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为云计算技术有限公司, 清华大学 filed Critical 华为云计算技术有限公司
Publication of WO2024093253A1 publication Critical patent/WO2024093253A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/9035Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • the present application relates to the field of data processing technology, and in particular to a data sampling method, system, computing device cluster, computer-readable storage medium, and computer program product.
  • Data preparation refers to the process of preprocessing raw data into data suitable for subsequent processing (such as mining and analysis).
  • Data preview is one of the important functional features of data preparation. Data preview provides users with a tool to quickly understand the distribution of data. Users can select the next data processing or analysis operation by previewing data. For example, if a user finds that there are many duplicate data entries through previewing data, he can submit a task to deduplicate the data set.
  • the typical data preview process is to randomly sample the dataset including the original data to obtain sample data, or select the first n rows as sample data, and then display the sample data.
  • the sample data selected by the above method is not representative enough, especially in small sample scenarios. Random sampling or selecting the sample data obtained by selecting the first n rows is difficult to represent the overall data distribution and difficult to provide help for subsequent data processing.
  • the present application provides a data sampling method, which can obtain sample data close to the global data distribution, thereby greatly improving the representativeness of the sample data.
  • the present application also provides a data sampling system, a computing device cluster, a computer-readable storage medium, and a computer program product corresponding to the method.
  • the present application provides a data sampling method.
  • the method may be performed by a data sampling system.
  • the data sampling system may be a software system, which is deployed in a computing device cluster.
  • the computing device cluster executes the program code of the data sampling system, thereby performing the data sampling method of the present application.
  • the data sampling system may also be a hardware system, such as a computing device cluster with a data sampling function.
  • the data sampling system can obtain a data set, and then determine the number of attribute columns and the data types of attribute values in the data set. Then, the data sampling system can sample from the data set to obtain a sample set based on the number of attribute columns and the data types of attribute values in the data set.
  • This method performs data sampling based on the number of attribute columns and the data type of attribute values in a large amount of original data, thereby obtaining sample data that is close to the global data distribution. This can improve the representativeness of the sample data and make the sample data more suitable for data preview scenarios, making it easier for users to perform subsequent data processing based on the sample data.
  • the data sampling system may also present a sample set to a user, or perform data analysis through an artificial intelligence (AI) algorithm based on the sample set.
  • AI artificial intelligence
  • the sample set obtained by sampling this method can be used in multiple scenarios such as data preview or data processing. Users can understand the data distribution of the data set through data preview, or determine subsequent processing operations on the data set through the sample set.
  • the data sampling system may obtain a sample set based on the proportions of multiple attribute values corresponding to the attribute columns in the data set that appear in the data set.
  • this method can perform data sampling based on the proportion of attribute values to improve the representativeness of the sample data.
  • the data set includes n rows of data
  • the sample set includes m rows of data
  • the sample set is a subset of the data set.
  • the data sampling system can obtain the first occurrence number of the i-th attribute value among multiple attribute values in the data set, and then determine the number of occurrences of the i-th attribute value in the sample set based on the ratio of the sizes of the data set and the sample set multiplied by the first occurrence number of the data.
  • the sample set obtained by this method can represent the distribution of multiple attribute values in the data set, making it easier for users to understand the global data distribution in the data set.
  • the data sampling system may determine the number of times the i-th attribute value appears in the sample set to be the integer.
  • the data sampling system may determine the number of times the i-th attribute value appears in the sample set to be an integer after rounding the product up or down based on the distance difference between the sample set and the data set after the product is rounded up or down.
  • this method uses the idea of KL divergence to evaluate the distance of the sample set after rounding up and rounding down, thereby determining the number of times the attribute value appears in the sample set. In this way, the degree of difference between the data distribution in the sample set and the data distribution in the initial sample set can be reduced.
  • the data sampling system may sort the multiple attribute values corresponding to the attribute column, and then select data from the sorted sequence according to the target interval to obtain a sample set.
  • this method selects a sample set from the sequence of sorted attribute values at a target interval, so that the sample set can reflect the data distribution in the data set and ensure that the data in the sample set is representative.
  • the data sampling system may select the first column sample set of each attribute column according to the data type of the attribute value corresponding to each attribute column in the multiple attribute columns, and then select m rows of data from the data set to obtain the initial row sample data.
  • the attribute values corresponding to each attribute column in the initial row sample data form the second column sample set of each attribute column.
  • the data sampling system may determine the distance between the first column sample set and the second column sample set of at least one attribute column when the target row replaces a row of data in the initial row sample data, and determine whether to perform a replacement operation on the initial row sample data based on the distance to obtain a sample set.
  • this method performs data sampling based on a greedy algorithm.
  • the target row replaces the initial row sample data, so that the data distribution of the sample set obtained by sampling is closer to the data distribution in the data set, thereby improving the representativeness of the sample data.
  • the data sampling system may refuse to perform an operation of replacing a row of data in the initial row sample data with a target row.
  • this method does not perform a replacement operation when the sample set after the replacement operation cannot better reflect the data distribution in the initial sample set, thereby ensuring that the sample data can better reflect the data distribution in the data set.
  • the data sampling system can determine the difference in the number of occurrences of various attribute values of the i-th column in the first column sample set of the i-th column and the second column sample set after replacement, and then determine the distance between the first column sample set and the second column sample set of at least one attribute column when the target row replaces a row of data in the initial row sample data based on the sum of the absolute values of the difference.
  • this method draws on the idea of KS statistics and determines the distance based on the difference in the number of attribute value occurrences before and after replacement to determine whether to perform the replacement operation, thereby sampling to obtain a sample set, which can improve the representativeness of the sample data.
  • the data sampling system may sort the first column sample set of the i-th column and the replaced second column sample set in the same way, and determine the difference between the elements in the sorted first column sample set and the corresponding elements in the sorted second column sample set. Then, the data sampling system may determine the distance between the first column sample set and the second column sample set of at least one attribute column when the target row replaces a row of data in the initial row sample data based on the sum of the absolute values of the differences.
  • this method draws on the idea of KS statistics and determines the distance based on the difference between the elements before and after replacement to determine whether to perform the replacement operation, thereby sampling to obtain a sample set, which can improve the representativeness of the sample data.
  • the data sampling system can perform two-stage sampling on the data set.
  • the first stage sampling can adopt random sampling, and the second stage sampling can be performed according to the attributes of the attribute columns in the data set and the data types of the attribute values.
  • the data size of the original data set can be effectively reduced, so that devices with limited computing resources can also perform subsequent second-stage sampling on the data set, thereby realizing the function of data preview in the big data scenario.
  • the method can be used for data preparation, so that the user can determine the processing operation on the data set based on the sample set. Since the sample set is close to the global data distribution in the data set, the method can improve the user's data preparation efficiency.
  • the method can be used for selecting machine learning training data. Since users can understand the data distribution in the data set through the sample set, the method can enable users to select representative data for data annotation.
  • the present application provides a data sampling system.
  • the system comprises:
  • the acquisition module is used to obtain the data set
  • a determination module used to determine the number of attribute columns and the data types of attribute values in the data set
  • the sampling module is used to obtain a sample set by sampling from the data set according to the number of the attribute columns and the data type of the attribute value.
  • system further includes:
  • an interaction module configured to present the sample set to a user
  • the data analysis module is used to perform data analysis based on the sample set through an artificial intelligence (AI) algorithm.
  • AI artificial intelligence
  • the sampling module is specifically used to:
  • a sample set is obtained by sampling according to the proportions of the multiple attribute values corresponding to the attribute columns in the data set appearing in the data set.
  • the data set includes n rows of data
  • the sample set includes m rows of data
  • the sample set is a subset of the data set
  • the sampling module is specifically used to:
  • the number of times the i-th attribute value appears in the sample set is determined according to the product of the ratio of the size of the data set to the size of the sample set and the first number of the data.
  • the sampling module is specifically used to:
  • the integer after the product is rounded up or rounded down is determined as the number of times the i-th attribute value appears in the sample set.
  • the sampling module is specifically used to:
  • data is selected according to the target interval to obtain a sample set.
  • the sampling module is specifically used to:
  • the sampling module is specifically used to:
  • the sampling module is specifically used to:
  • the at least one attribute column includes an i-th column, and when the attribute value corresponding to the i-th column is discrete, determining the difference in the number of occurrences of various attribute values of the i-th column in a first column sample set and a second column sample set after replacement of the i-th column;
  • a distance between a first column sample set and a second column sample set of at least one attribute column when the target row replaces a row of data in the initial row sample data is determined.
  • the sampling module is specifically used to:
  • the at least one attribute column includes an i-th column, and when the attribute value corresponding to the i-th column is continuous, the first column sample set and the replaced second column sample set of the i-th column are sorted in the same way;
  • a distance between a first column sample set and a second column sample set of at least one attribute column when the target row replaces a row of data in the initial row sample data is determined.
  • the present application provides a computing device cluster.
  • the computing device cluster includes at least one computing device, and the at least one computing device includes at least one processor and at least one memory.
  • the at least one processor and the at least one memory communicate with each other.
  • the at least one processor is used to execute instructions stored in the at least one memory, so that the computing device or the computing device cluster performs the data sampling method described in the first aspect or any implementation of the first aspect.
  • the present application provides a computer-readable storage medium, wherein the computer-readable storage medium stores instructions, wherein the instructions instruct a computing device or a computing device cluster to execute the data sampling method described in the first aspect or any one of the implementations of the first aspect.
  • the present application provides a computer program product comprising instructions, which, when executed on a computing device or a computing device cluster, enables the computing device or the computing device cluster to execute the data sampling method described in the first aspect or any one of the implementations of the first aspect.
  • FIG1 is a schematic diagram of a method for acquiring sample data provided in an embodiment of the present application.
  • FIG2 is a schematic diagram of a method for obtaining sample data provided in an embodiment of the present application.
  • FIG3 is a schematic diagram of the architecture of a data sampling system provided in an embodiment of the present application.
  • FIGS. 4A to 4C are schematic diagrams of a sample set display interface provided in an embodiment of the present application.
  • FIG5 is a flow chart of a data sampling method provided in an embodiment of the present application.
  • FIG6 is a schematic diagram of a flow chart of a two-stage data sampling method provided in an embodiment of the present application.
  • FIG7 is a schematic diagram of the structure of a data sampling system provided in an embodiment of the present application.
  • FIG8 is a schematic diagram of the structure of a computing device provided in an embodiment of the present application.
  • FIG9 is a schematic diagram of the structure of a computing device cluster provided in an embodiment of the present application.
  • FIG10 is a schematic diagram of the structure of a computing device cluster provided in an embodiment of the present application.
  • FIG. 11 is a schematic diagram of the structure of a computing device cluster provided in an embodiment of the present application.
  • first and second in the embodiments of the present application are used for descriptive purposes only and should not be understood as indicating or implying relative importance or implicitly indicating the number of the indicated technical features. Therefore, the features defined as “first” and “second” may explicitly or implicitly include one or more of the features.
  • Data preparation refers to the process of preprocessing raw data to make it more suitable for further processing and analysis.
  • raw data that has been processed by data preparation can be used as more ideal known data to input into machine learning (ML) algorithms to improve the effectiveness of machine learning algorithms.
  • ML machine learning
  • the process of data preparation includes multiple steps such as data collection, data preview, data cleaning, and data labeling.
  • data preview is one of the important functional features of data preparation.
  • Data preview refers to the process of presenting a small part of sample data (also called preview data) in a large amount of original data to the user, so that the user can quickly understand the distribution of the original data. Users can select the next data processing or analysis operation through preview data. For example, if the user finds that there are many duplicate data entries through previewing the data, he can submit a task to deduplicate the original data. For another example, if the user finds that there are many missing value data entries through previewing the data, he can submit a task to fill the missing values of the original data.
  • Data preview is usually achieved through data sampling technology.
  • Data sampling refers to the process of selecting a subset (also called a sample) from the original data set to estimate the characteristics of the whole.
  • the process of data preview can include importing original data, obtaining sample data and data preview.
  • the user can import raw data from multiple data sources.
  • the data preview device uses data sampling technology to obtain sample data from the raw data and displays the sample data to the user.
  • There are multiple methods for obtaining sample data referring to the schematic diagram of a method for obtaining sample data shown in FIG1, the raw data includes multiple rows of data, and the data preview device can select the first n rows of data in the raw data as sample data.
  • the raw data packet The data preview device may randomly select n lines of data from the original data as sample data.
  • the sample data obtained by the above method is not representative enough and it is difficult to characterize the data distribution of the original data as a whole.
  • the data preview device displays the sample data to the user, it is difficult to provide assistance to the user for subsequent data processing, that is, it is difficult for the user to select appropriate data processing or analysis operations based on the sample data.
  • Common data sampling technologies may include data sampling methods based on filtering, data sampling methods based on abnormal situations, data sampling methods based on stratified sampling, and the like. For example, when the original data includes multiple rows of data with an attribute value of "year”, a data sampling method based on filtering can be used to obtain data with a year attribute value of "2022" as sample data. For another example, when the original data includes multiple rows of data with missing values, a data sampling method based on abnormal situations can be used to obtain data with missing values as sample data.
  • the above methods are mainly aimed at data analysis tasks, and the data sampling method used needs to be set according to the specific analysis task, which does not fully match the needs of data preview. At the same time, in small sample scenarios, it is difficult to obtain sample data close to the global data distribution using the above methods.
  • an embodiment of the present application provides a data sampling method, which can be performed by a data sampling system.
  • the data sampling system can obtain a data set, and then determine the number of attribute columns and the data type of attribute values in the data set. Then, the data sampling system can sample from the data set to obtain a sample set according to the number of attribute columns and the data type of attribute values in the data set.
  • This method performs data sampling based on the number of attribute columns and the data type of attribute values in a large amount of original data, thereby obtaining sample data that is close to the global data distribution. This can improve the representativeness of the sample data and make the sample data more suitable for data preview scenarios, making it easier for users to perform subsequent data processing based on the sample data.
  • the data sampling system 30 is connected to at least one data source 10 and interacts with a user 20.
  • the data sampling system 30 may also be connected to a data processing system 40 to provide the sampled data to the data processing system 40 for data analysis and other data processing.
  • the data sampling system 30 may be a software system, which may be deployed in a computing device cluster, and the computing device cluster executes the data sampling method by running the program code of the software system.
  • the data sampling system 30 may also be a hardware system, which may be a computer cluster with a data sampling function. When the hardware system is running, the data sampling method of the embodiment of the present application is implemented. The embodiment of the present application is illustrated by taking the data sampling system 30 as a software system.
  • the data source 10 may be software or hardware that provides a data set.
  • the data source 10 may be a search engine, a database, or other application that generates a large amount of data.
  • the data source 10 may also be hardware that can generate a data set or store a data set.
  • the data processing system 40 may be a software system that may be deployed in a computer cluster that implements data processing functions by executing the program code of the software system.
  • the data processing system 40 may also be a hardware system that implements data processing functions when the hardware system is running.
  • the data sampling system 30 includes a sample acquisition device 32 and a data preview device 34. Each device is introduced below.
  • the sample acquisition device 32 is used to acquire a data set from at least one data source 10 and sample from the data set to obtain a sample set.
  • the sample acquisition device 32 can acquire the sample set from the data set according to the number of attribute columns in the data set and the data type of the attribute value.
  • the number of the above-mentioned attribute columns may include 1 or be greater than 1, and the data type of the attribute value may include discrete type and continuous type.
  • the sample acquisition device 32 may sample from the data set according to the proportion of the multiple attribute values corresponding to the above-mentioned attribute columns in the data set to obtain a sample set.
  • the sample acquisition device 32 may sort the multiple attribute values corresponding to the above-mentioned attribute columns, select data for sampling according to the target interval, and thus obtain a sample set.
  • the sample acquisition device 32 may determine whether to replace the initial row sample data according to the distance between the selected initial row sample data and the target row data, and thus obtain a sample set.
  • the data preview device 34 is used to present the sample set to the user 20 to realize the data preview function.
  • the data preview device 34 can provide an interactive interface, which can also be called a user interface (UI) interface.
  • the data preview device 34 can realize data preview through the UI interface.
  • the interactive interface can include a graphical user interface (GUI). Or command user interface (CUI).
  • the UI interface provided by the data preview device 34 may include a sample set display interface. Referring to the sample set display interface 400 shown in FIG. 4A to FIG. 4C , the data preview device 34 may present the sample set obtained by the sample acquisition module 32 to the user 20 through the sample set display interface 400, so that the user 20 can understand the distribution of the data set based on the sample set.
  • the left side of the sample set display interface 400 is a data source selection interface.
  • the data source selection interface includes a data source search box 402 and a data source display interface 404.
  • the user 20 can search for the target data source through the data source search box 402 and select the target data set that needs to be previewed from the target data source, or directly click the target data source from the data source display interface 404 and determine the target data set in the target data source to complete the selection of the target data set.
  • the right side of the sample set display interface 400 includes a data sampling method selection area 406 and a data preview area 408. The user 20 can select the data sampling method of the target data set through the data preview method selection area 406 and preview the data of the sample set of the target data set through the data preview area 408.
  • user 20 selects data set 1 in data source 1 for data preview.
  • the background of the area where data set 1 is located can become dark, indicating that the data set is the target data set.
  • user 20 can obtain a drop-down menu 407 by clicking the drop-down mark in the data sampling method selection area 406 to select the data sampling method of data set 1.
  • the data sampling method can include random selection, selecting the first n rows, selecting by condition, and selecting representative samples. Selecting representative samples is the data sampling method provided in the embodiment of the present application.
  • User 20 can determine the sampling method for data set 1 by clicking the data sampling method in the drop-down menu 407.
  • the user 20 selects “select representative samples” as the data sampling method of the data set 1.
  • the user 20 can preview the sample set of the data set 1 through the data preview area 408.
  • the sample set of the data set 1 is obtained by sampling through the data sampling method of “select representative samples”.
  • the data sampling system 30 After the data sampling system 30 samples the data set and presents the sample set to the user 20, the data set can be provided to the data processing system 40, so that the data processing system 40 can perform subsequent processing on the data set according to the operation selected by the user 20.
  • the user 20 can determine the processing operations of the data processing system 40 on the data set based on the sample set, such as deduplication operations and missing value filling operations.
  • the user 20 can select the processing operations that need to be performed on the data set based on the results of the data preview so that the data processing system 40 can better manage the data.
  • the data sampling system 30 when the data sampling system 30 is used to select machine learning training data, the user 20 can understand the data distribution in the data set based on the sample set, so as to select representative data for data annotation.
  • the method includes:
  • the data sampling system 30 acquires a data set.
  • a data set refers to a collection of data.
  • the data set can store data in the form of a table, wherein each column of the table represents an attribute value, and each row represents the data of a member.
  • the data sampling system 30 may obtain a data set from at least one data source, wherein the data source may include a search engine, a database, or other applications that can provide a data set, and the embodiment of the present application does not limit the type of the data source.
  • the data source may include a search engine, a database, or other applications that can provide a data set, and the embodiment of the present application does not limit the type of the data source.
  • the data sampling system 30 determines the number of attribute columns and the data types of attribute values in the data set.
  • the columns in the data set may include index columns and attribute columns.
  • the number of index columns may be 1, and the role of the index column is equivalent to a directory.
  • the user can quickly find the corresponding content in the data set according to the index value of the index column.
  • the index value of each row of data in the data set is unique, thereby ensuring the uniqueness of each row of data in the data set.
  • the index column may include a number, a user ID, etc. When the index column is a number, the index value corresponding to the index column may be 1, 2, ..., n. When the index column is a user ID, the index value corresponding to the index column may be U01, U02, ..., Unn.
  • the data sampling system 30 can retrieve and identify the data in the data set.
  • Attribute columns are used to represent attribute values of different attributes.
  • attribute columns may include supplier, cloud service, city, price, etc.
  • the number of attribute columns may be 1 or greater than 1.
  • the number of attribute columns being 1 indicates that there is only one attribute column in the data set, and the number of attribute columns being greater than 1 indicates that there are multiple attribute columns in the data set.
  • the data types of attribute values corresponding to attribute columns may include discrete and continuous types.
  • the data type of an attribute value being discrete indicates that the attribute value is discrete data.
  • discrete attribute values may include attribute values corresponding to the attribute columns being supplier, cloud service, and city.
  • the data type of an attribute value being continuous indicates that the attribute value is continuous data and may take any value within the interval to which the attribute value belongs.
  • continuous attribute values may include attribute values corresponding to the attribute column being income.
  • S506 The data sampling system 30 samples the data set according to the number of attribute columns in the data set and the data types of the attribute values. This episode.
  • the amount of data included in the data set is large.
  • the data set is a data set in a big data application scenario, and the amount of data in the data set can reach 100 million.
  • the data sampling system 30 directly loads the data set into the memory and samples the data set to obtain a sample set, it takes a long sampling time and has a high computational cost, which may cause a memory overflow problem. Therefore, an embodiment of the present application provides a two-stage data sampling method.
  • the data sampling system 30 can perform first-stage sampling and second-stage sampling on the data set. First, the data sampling system 30 can perform first-stage sampling on the data set to obtain an initial sample set, and then the data sampling system 30 can perform second-stage sampling on the initial sample set to obtain a sample set.
  • the data sampling system 30 may use random sampling technology to obtain an initial sample set from the data set.
  • the sample size in the initial sample set may be larger than the sample size in the sample set.
  • the maximum sample size in the sample set is 1000.
  • the sample size in the initial sample set may be 100,000.
  • Random sampling refers to sampling a data set according to a given ratio or sample size, so as to obtain an initial sample set from the data set with equal probability.
  • the data sampling system 30 can pre-set the sampling ratio, and randomly sample the data set according to the sampling ratio to obtain the initial sample set.
  • the sampling ratio can be any real number between 0 and 1.
  • the data sampling system 30 can set the sampling ratio to 0.03. When the amount of data in the data set is 100 million, the sample size in the initial sample set obtained by random sampling is 3 million.
  • the data sampling system 30 can pre-set the sampling number, determine the sampling ratio according to the sampling number, and then randomly sample the data set based on the sampling ratio to obtain the initial sample set.
  • the data sampling system 30 can set the sampling number to 3 million.
  • the data sampling system 30 determines that the sampling ratio is 0.03 based on the ratio of the sampling number to the amount of data in the data set, so as to perform random sampling according to the sampling ratio to obtain an initial sample set with a sample size of 3 million.
  • the data size of the original data set can be effectively reduced, so that devices with limited computing resources can also perform subsequent second-stage sampling on the data set, thereby realizing the data preview function in the big data scenario.
  • the embodiment of the present application uses random sampling for the first stage sampling as an example for explanation.
  • the data sampling system 30 may use different sampling methods for the first stage sampling, and the embodiment of the present application does not limit this.
  • the first stage sampling provided in the embodiment of the present application is an optional step.
  • the data sampling system 30 may not perform the first stage sampling and directly perform the second stage sampling on the data set. The embodiment of the present application does not limit this.
  • the data sampling system 30 may perform the second stage sampling on the initial sample set to obtain the sample set. Specifically, the data sampling system 30 may obtain the sample set by sampling from the initial sample set according to the number of attribute columns and the data type of the attribute value.
  • the following describes the second-stage sampling process in three cases based on the number of attribute columns and the data type of the attribute values.
  • Case 1 Discrete data of a single attribute column: When the number of attribute columns is 1 and the data type of the attribute value is discrete, the data sampling system 30 can obtain a sample set by sampling according to the proportions of multiple attribute values corresponding to the attribute column in the initial sample set.
  • the initial sample set includes n rows of data
  • the sample set includes m rows of data
  • the sample set is a subset of the initial sample set.
  • the data sampling system 30 can obtain the first occurrence number of the i-th attribute value in the initial sample set among the multiple attribute values corresponding to the attribute column, and then determine the number of occurrences of the i-th attribute value in the sample set based on the product of the ratio of the size of the initial sample set to the sample set and the first occurrence number.
  • the data sampling system 30 can determine the number of times the i-th attribute value appears in the sample set to be the integer.
  • the data sampling system 30 can determine the number of times the i-th attribute value appears in the sample set as the integer after rounding up or rounding down the product based on the distance difference between the sample set and the initial sample set after rounding up or rounding down the product.
  • the first occurrence number of the i-th attribute value in the initial sample set is recorded as n i
  • the number of occurrences of the i-th attribute value in the sample set is recorded as mi .
  • the product of the ratio of the size of the initial sample set to the sample set and the first occurrence number can be expressed as mn i /n.
  • m/n represents the ratio of the size of the initial sample set to the sample set.
  • the data sampling system 30 can determine the distance difference ⁇ i between the sample set and the initial sample set after the product is rounded up and rounded down:
  • ceil(*) represents the upward rounding function
  • floor(*) represents the downward rounding function
  • mi ceil(mni/n)
  • mi floor(mni/n)
  • the data sampling system 30 can sort ⁇ i to determine the value of mi .
  • the mi corresponding to the first ⁇ i is rounded up, and the mi corresponding to the remaining ⁇ i is rounded down, so that the obtained sample set can better represent the data distribution of the initial sample set.
  • data sampling is performed according to the proportion of attribute values in the initial sample set, so that the sample set can represent the distribution of multiple attribute values in the initial sample set.
  • the distance of the sample set after rounding up and rounding down is evaluated by drawing on the idea of KL divergence (Kullback-Leibler divergence), so as to determine the number of times the attribute value appears in the sample set, so that the difference between the data distribution in the sample set and the data distribution in the initial sample set can be reduced.
  • the initial sample set includes two columns, the first column is the index column "User ID”, the index values are "U01" to "U09”, and the second column is the attribute column "Location”, the attribute values include "Beijing” and "Hangzhou”. It can be seen that the number of attribute columns in Table 1 is 1, and the data type of the attribute value is discrete.
  • Data sampling is performed for the data in Table 1.
  • Case 2 Continuous data of a single attribute column
  • the data sampling system 30 can obtain a sample set by sampling at a target interval.
  • the initial sample set includes n rows of data
  • the sample set includes m rows of data
  • the sample set is a subset of the initial sample set.
  • the data sampling system 30 can sort the multiple attribute values corresponding to the attribute column, and select data from the sorted sequence according to the target interval to obtain the sample set.
  • the data sampling system 30 arranges the multiple attribute values corresponding to the attribute column in ascending order, and records the sorted sequence as ⁇ x 1 , x 2 , x 3 , . . . , x n ⁇ , then the sample set can be expressed as:
  • a sample set is selected from the sequence after the attribute values are sorted at a target interval, so that the sample set can reflect the data distribution in the initial sample set, ensuring that the data in the sample set is representative.
  • the initial sample set includes two columns, the first column is the index column "User ID”, the index values are "U01" to "U09", and the second column is the attribute column "Income”, and the attribute values include ⁇ 200, 500, 100, 800, 600, 700, 300, 900, 400 ⁇ . It can be seen that the number of attribute columns in Table 2 is 1, and the data type of the attribute value is continuous.
  • Data sampling is performed for the data in Table 2.
  • the sequence after the attribute values in Table 2 are sorted in ascending order is ⁇ 100, 200, 300, 400, 500, 600, 700, 800, 900 ⁇ .
  • Case 3 Data with multiple attribute columns: When the number of attribute columns is greater than 1, the data sampling system 30 may obtain a sample set based on a greedy algorithm.
  • the initial sample set includes n rows of data
  • the sample set includes m rows of data
  • the sample set is a subset of the initial sample set.
  • the data sampling system 30 can select the first column sample set of each attribute column according to the data type of the attribute value corresponding to each attribute column in the multiple attribute columns, and then select m rows of data from the initial sample set to obtain initial row sample data.
  • the attribute values corresponding to each attribute column in the data sample form the second column sample set of each attribute column.
  • the data sampling system 30 determines the distance between the first column sample set and the second column sample set of at least one attribute column when the target row replaces a row of data in the initial row sample data for the target row data other than the m rows of data, and determines whether to perform the replacement operation on the initial row sample data according to the distance to obtain the sample set.
  • the data sampling system 30 can use the method described in the above case 1 to select the first column sample set of the attribute column.
  • the data sampling system 30 can use the method described in the above case 2 to select the first column sample set of the attribute column. Further, the data sampling system 30 can select the first m rows of data from the initial sample set as the initial row sample data.
  • the data sampling system 30 may refuse to perform the operation of replacing a row of data in the initial row sample data with the target row.
  • the sum of the distances between the first column sample set and the second column sample set of at least one attribute column is greater than the sum of the distances between the first column sample set and the second column sample set of each attribute column before replacement, indicating that the sample set after the replacement operation cannot better reflect the data distribution in the initial sample set, so the replacement operation is not performed.
  • a corresponding method can be used to determine the distance between the first column sample set and the second column sample set of at least one attribute column when the target row replaces a row of data in the initial row sample data based on the data type of the attribute value of the i-th column.
  • the data sampling system 30 can determine the difference in the number of occurrences of various attribute values of the i-th column in the first column sample set of the i-th column and the second column sample set after replacement, and then determine the distance between the first column sample set and the second column sample set of at least one attribute column when the target row replaces a row of data in the initial row sample data based on the sum of the absolute values of the above differences.
  • the data sampling system 30 can sort the first column sample set of the i-th column and the replaced second column sample set in the same way, and determine the difference between the elements in the sorted first column sample set and the corresponding elements in the sorted second column sample set, and then determine the distance between the first column sample set and the second column sample set of at least one attribute column when the target row replaces a row of data in the initial row sample data based on the sum of the absolute values of the difference.
  • the data sampling system 30 can select the first column sample set of each attribute column by using the method in the above case 1 or case 2 according to the data type of the attribute value corresponding to each attribute column, and record the first column sample set of the i-th column as Next, the data sampling system 30 uses the first m rows of data in the initial sample set as initial row sample data, thereby obtaining a second column sample set corresponding to the initial row sample data, and records the second column sample set of the i-th column as S i .
  • the data sampling system 30 can traverse the target rows to determine the distance between the first column sample set and the second column sample set when the target row replaces a row of data in the initial row sample data.
  • the value types of the attribute value corresponding to the i-th column are K
  • the j-th value is in the sample set of the first column.
  • the number of occurrences in The number of times the jth value appears in the second column sample set Si is recorded as mj .
  • the distance can be:
  • the distance can be:
  • KS statistic Kolmogorov-Smirnov Statistic
  • the data sampling system 30 can determine the distance between the first column sample set and the second column sample set of at least one attribute column when the target row under investigation replaces multiple rows (for example, each row) of data in the initial row sample.
  • the row with the largest reduction in distance is selected for replacement.
  • the initial row sample includes two rows.
  • the target row replaces the first row in the initial row sample
  • the sum of the distances can be reduced by 0.5.
  • the target row replaces the second row in the initial row sample
  • the sum of the distances can be reduced by 0.7.
  • the target row is selected to replace the second row in the initial row sample.
  • the replacement effect is improved by selecting the row with the largest reduction in distance after replacement for replacement.
  • the data sampling system 30 may also determine the distance between the first column sample set and the second column sample set of at least one attribute column when replacing a row of data in the initial row sample in a certain order (for example, from top to bottom), and perform a replacement operation when the sum of the distances between the first column sample set and the second column sample set of at least one attribute column is less than the sum of the distances between the first column sample set and the second column sample set of each attribute column before replacement.
  • the initial row sample includes two rows.
  • the data sampling system 30 determines that the sum of the distances can be reduced by 0.5 by replacing the first row in the initial row sample with the target row, the first row in the initial row sample is directly selected to replace the target row, and the reduction in the sum of the distances when replacing the second row in the initial row sample with the target row is no longer calculated.
  • the replacement speed is improved by selecting the row whose distance is reduced after the first replacement.
  • the row that has been replaced in the initial row sample may not be considered.
  • the initial row sample includes two rows
  • the target row includes two rows, which are recorded as the first target row and the second target row.
  • the data sampling system 30 has already performed the operation of replacing the first row in the initial row sample with the first target row.
  • the sum of the distance after the first target row is replaced by the second target row may not be considered, and only the sum of the distance after the second target row replaces the second row in the initial row sample is determined, so as to determine whether to perform the replacement operation on the second row in the initial row sample, thereby reducing the amount of calculation.
  • the embodiment of the present application may provide a pruning strategy to improve replacement efficiency.
  • the data sampling system 30 may stop the calculation and reject the operation of replacing a row of data in the initial row sample data with the target row.
  • the embodiments of the present application may provide incremental calculation to reduce the amount of calculation.
  • the data sampling system 30 may only recalculate the distances between the first column sample set and the second column sample set of the attribute column whose attribute values corresponding to the attribute column have changed, and for the attribute column whose attribute values corresponding to the attribute column have not changed, the data sampling system 30 may reuse the distances between the first column sample set and the second column sample set of the attribute column before replacement.
  • the initial sample set includes three columns.
  • the first column is the index column "User ID”, and the index values are "U01" to "U09".
  • the second column is the attribute column “Location”, and the attribute values include "Beijing” and "Hangzhou”.
  • the third column is the attribute column "Income”, and the attribute values include ⁇ 200, 500, 100, 800, 600, 700, 300, 900, 400 ⁇ . It can be seen that the number of attribute columns in Table 3 is 2, and the data types of the attribute values include discrete and continuous types.
  • the sum of the distances between the first column sample set and the second column sample set before and after the replacement is determined.
  • the data sampling system 30 can perform data sampling using a corresponding sampling method according to the number of attribute columns and the data types of attribute values, thereby obtaining a sample set.
  • the data sampling system 30 may also present the sample set to the user to achieve data preview, or perform data analysis based on the sample set through an artificial intelligence (AI) algorithm.
  • AI artificial intelligence
  • the data sampling system 30 may send the sample set to the data processing system 40, so that data processing operations such as data analysis may be performed in the data processing system 40.
  • This method performs data sampling based on the number of attribute columns and the data type of attribute values in a large amount of original data, thereby obtaining sample data that is close to the global data distribution. This can improve the representativeness of the sample data and make the sample data more suitable for data preview scenarios, making it easier for users to perform subsequent data processing based on the sample data.
  • the embodiment of the present application further provides a data sampling system 30 as described above.
  • the data sampling system 30 is introduced below in conjunction with the accompanying drawings.
  • the system 30 includes:
  • An acquisition module 302 is used to acquire a data set
  • a determination module 304 is used to determine the number of attribute columns and the data types of attribute values in the data set
  • the sampling module 306 is used to obtain a sample set by sampling from the data set according to the number of attribute columns and the data types of attribute values.
  • the acquisition module 302 , the determination module 304 and the sampling module 306 may be modules in the sample acquisition device 32 .
  • module division method is only a possible implementation method provided by the embodiment of the present application. In other possible implementation methods, different module division methods can be used as needed, and the embodiment of the present application does not limit this.
  • the acquisition module 302, determination module 304 and sampling module 306 can be implemented by hardware modules or software modules.
  • the acquisition module 302, determination module 304 and sampling module 306 can be implemented by a computing device or a computing engine on a computing device.
  • the acquisition module 302 is taken as an example for description.
  • the acquisition module 302 may be an application or application module running on a computing device or a computing device cluster, such as a computing engine.
  • the acquisition module 302 may include at least one computing device, such as a server, etc.
  • the acquisition module 302 may also be a device implemented by an application-specific integrated circuit (ASIC) or a programmable logic device (PLD).
  • ASIC application-specific integrated circuit
  • PLD programmable logic device
  • the PLD may be a complex programmable logical device (CPLD), a field-programmable gate array (FPGA), a generic array logic (GAL) or any combination thereof.
  • CPLD complex programmable logical device
  • FPGA field-programmable gate array
  • GAL generic array logic
  • system 30 further includes:
  • An interactive module for presenting the sample set to the user;
  • the data analysis module is used to perform data analysis based on the sample set through artificial intelligence (AI) algorithms.
  • AI artificial intelligence
  • sampling module 306 is specifically configured to:
  • the sample set is obtained by sampling according to the proportion of multiple attribute values corresponding to the attribute columns in the data set that appear in the data set.
  • the data set includes n rows of data
  • the sample set includes m rows of data
  • the sample set is a subset of the data set
  • the sampling module 306 is specifically used to:
  • the number of times the i-th attribute value appears in the sample set is determined based on the product of the ratio of the size of the data set to the sample set and the first order of the data.
  • sampling module 306 is specifically configured to:
  • the number of times the i-th attribute value appears in the sample set is determined to be the integer
  • the integer after rounding up or down of the product is determined as the number of times the i-th attribute value appears in the sample set according to the distance difference between the sample set and the data set after the product is rounded up or down.
  • sampling module 306 is specifically configured to:
  • data is selected according to the target interval to obtain a sample set.
  • sampling module 306 is specifically configured to:
  • the first column sample set of each attribute column is selected according to the data type of the attribute value corresponding to each attribute column in the multiple attribute columns;
  • sampling module 306 is specifically configured to:
  • sampling module 306 is specifically configured to:
  • At least one attribute column includes the i-th column, and when the attribute value corresponding to the i-th column is discrete, determining the difference in the number of occurrences of various attribute values of the i-th column in the first column sample set of the i-th column and the second column sample set after replacement;
  • the distance between the first column sample set and the second column sample set of at least one attribute column when the target row replaces a row of data in the initial row sample data is determined.
  • sampling module 306 is specifically configured to:
  • At least one attribute column includes the i-th column, and when the attribute value corresponding to the i-th column is continuous, the first column sample set and the replaced second column sample set of the i-th column are sorted in the same way;
  • the distance between the first column sample set and the second column sample set of at least one attribute column when the target row replaces a row of data in the initial row sample data is determined.
  • the present application also provides a computing device 800.
  • the computing device 800 includes: a bus 802, a processor 804, a memory 806, and a communication interface 808.
  • the processor 804, the memory 806, and the communication interface 808 communicate with each other through the bus 802.
  • the computing device 800 can be a server or a terminal device. It should be understood that the present application does not limit the number of processors and memories in the computing device 800.
  • the bus 802 may be a peripheral component interconnect (PCI) bus or an extended industry standard architecture (EISA) bus, etc.
  • the bus may be divided into an address bus, a data bus, a control bus, etc.
  • FIG. 8 is represented by only one line, but does not mean that there is only one bus or one type of bus.
  • the bus 802 may include a path for transmitting information between various components of the computing device 800 (e.g., the memory 806, the processor 804, and the communication interface 808).
  • Processor 804 may include any one or more of a central processing unit (CPU), a graphics processing unit (GPU), a microprocessor (MP), or a digital signal processor (DSP).
  • CPU central processing unit
  • GPU graphics processing unit
  • MP microprocessor
  • DSP digital signal processor
  • the memory 806 may include a volatile memory (volatile memory), such as a random access memory (RAM).
  • volatile memory such as a random access memory (RAM).
  • the memory 806 may also include a non-volatile memory (non-volatile memory), such as a read-only memory (ROM), a flash memory, a hard disk drive (HDD) or a solid state drive (SSD).
  • ROM read-only memory
  • flash memory such as a read-only memory (ROM), a flash memory, a hard disk drive (HDD) or a solid state drive (SSD).
  • ROM read-only memory
  • HDD hard disk drive
  • SSD solid state drive
  • the memory 806 stores executable program code, and the processor 804 executes the executable program code to implement the aforementioned cache management method. Specifically, the memory 806 stores instructions for the data sampling system 30 to execute the data sampling method.
  • the communication interface 808 uses a transceiver module such as, but not limited to, a network interface card or a transceiver to implement communication between the computing device 800 and other devices or communication networks.
  • a transceiver module such as, but not limited to, a network interface card or a transceiver to implement communication between the computing device 800 and other devices or communication networks.
  • the embodiment of the present application also provides a computing device cluster.
  • the computing device cluster includes at least one computing device.
  • the computing device can be a server, such as a central server, an edge server, or a local server in a local data center.
  • the computing device can also be a terminal device such as a desktop computer, a laptop computer, or a smart phone.
  • the computing device cluster includes at least one computing device 800.
  • the memory 806 in one or more computing devices 800 in the computing device cluster may store the same data sampling system 30 for executing instructions of the data sampling method.
  • one or more computing devices 800 in the computing device cluster may also be used to execute some instructions of the data sampling system 30 for executing the data sampling method.
  • a combination of one or more computing devices 800 may jointly execute instructions of the data sampling system 30 for executing the data sampling method.
  • the memory 806 in different computing devices 800 in the computing device cluster may store different instructions for executing partial functions of the data sampling system 30 .
  • FIG10 shows a possible implementation.
  • two computing devices 800A and 800B are connected via a communication interface 808.
  • the memory in the computing device 800A stores instructions for executing the functions of the acquisition module 302 and the determination module 304.
  • the memory in the computing device 800B stores instructions for executing the functions of the sampling module 306.
  • the memories 806 of the computing devices 800A and 800B jointly store instructions for the data sampling system 30 to execute the data sampling method.
  • connection mode between the computing device clusters shown in FIG10 may be considered that the data sampling method provided by the present application needs to determine the relevant information of the attribute columns and attribute values in the data set, so as to perform data sampling. Therefore, it is considered that the functions implemented by the acquisition module 302 and the determination module 304 are performed by the computing device 800A, and the functions implemented by the sampling module 306 are performed by the computing device 800B.
  • computing device 800A shown in FIG10 may also be completed by multiple computing devices 800.
  • functionality of the computing device 800B may also be completed by multiple computing devices 800.
  • one or more computing devices in the computing device cluster may be connected via a network.
  • the network may be a wide area network or a local area network, etc.
  • FIG. 11 shows a possible implementation. As shown in FIG. 11 , two computing devices 800C and 800D are connected via a network. Specifically, the network is connected via a communication interface in each computing device.
  • the memory 806 in the computing device 800C stores instructions for executing the functions of the acquisition module 302 and the determination module 304. At the same time, the memory 806 in the computing device 800D stores instructions for executing the functions of the sampling module 306.
  • connection mode between the computing device clusters shown in FIG11 may be based on the consideration that the data sampling method provided in the present application needs to determine the relevant information of the attribute columns and attribute values in the data set, so as to perform data sampling. Therefore, it is considered that the functions implemented by the acquisition module 302 and the determination module 304 are handed over to the computing device 800C for execution, and the functions implemented by the sampling module 306 are executed by the computing device 800D. It should be understood that the functions of the computing device 800C shown in FIG11 can also be completed by multiple computing devices 800. Similarly, the functions of the computing device 800D can also be completed by multiple computing devices 800.
  • the embodiment of the present application also provides a computer-readable storage medium.
  • the computer-readable storage medium can be any available medium that can be stored by a computing device or a data storage device such as a data center that contains one or more available media.
  • the available medium can be a magnetic medium (e.g., a floppy disk, a hard disk, a tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a solid-state hard disk).
  • the computer-readable storage medium includes instructions that instruct the computing device to execute the above-mentioned data sampling system for executing the data sampling method.
  • the embodiment of the present application also provides a computer program product including instructions.
  • the computer program product may be software or a program product including instructions that can be run on a computing device or stored in any available medium.
  • the at least one computing device executes the above data sampling method.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Provided in the present application is a data sampling method, comprising: acquiring a data set; determining the number of attribute columns and the data types of attribute values in the data set; and according to the number of attribute columns and the data types of the attribute values, performing sampling to obtain a sample set from the data set. The method can acquire sample data which is close to global data distribution, such that the representativeness of the sample data can be improved, and the sample data is also more suitable for a data preview scenario, thereby facilitating the performing of subsequent data processing by a user according to the sample data.

Description

一种数据采样方法及相关设备A data sampling method and related equipment
本申请要求于2022年11月03日提交中国国家知识产权局、申请号为202211372127.7、发明名称为“一种数据采样方法及相关设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims priority to the Chinese patent application filed with the State Intellectual Property Office of China on November 3, 2022, with application number 202211372127.7 and invention name “A data sampling method and related equipment”, the entire contents of which are incorporated by reference in this application.
技术领域Technical Field
本申请涉及数据处理技术领域,尤其涉及一种数据采样方法、***、计算设备集群、计算机可读存储介质、计算机程序产品。The present application relates to the field of data processing technology, and in particular to a data sampling method, system, computing device cluster, computer-readable storage medium, and computer program product.
背景技术Background technique
随着互联网的产生,尤其是移动互联网的产生,数据呈现***性增长趋势。对海量的数据进行挖掘、分析成为研究的热点。为了对数据进行挖掘、分析等后续的处理,通常需要进行数据准备工作。With the emergence of the Internet, especially the emergence of mobile Internet, data has shown an explosive growth trend. Mining and analyzing massive amounts of data has become a research hotspot. In order to mine and analyze data and perform other subsequent processing, data preparation is usually required.
数据准备是指将原始数据预处理为适合于后续处理(如挖掘、分析)的数据的过程。数据预览是数据准备重要的功能特性之一。数据预览为用户提供了一个快速了解数据分布的工具。用户可以通过预览数据来选择下一步的数据处理或分析操作。例如,用户通过预览数据,发现重复数据条目比较多,可以提交一个对该数据集进行数据去重的任务。Data preparation refers to the process of preprocessing raw data into data suitable for subsequent processing (such as mining and analysis). Data preview is one of the important functional features of data preparation. Data preview provides users with a tool to quickly understand the distribution of data. Users can select the next data processing or analysis operation by previewing data. For example, if a user finds that there are many duplicate data entries through previewing data, he can submit a task to deduplicate the data set.
典型的数据预览流程是对包括原始数据的数据集进行随机采样获得样本数据,或者是选取前n行作为样本数据,然后展示样本数据。然而,上述方法选取的样本数据的代表性不足,尤其是在对小样本场景中,随机采样或者是选取前n行得到的样本数据很难表征整体的数据分布,难以为后续的数据处理提供帮助。The typical data preview process is to randomly sample the dataset including the original data to obtain sample data, or select the first n rows as sample data, and then display the sample data. However, the sample data selected by the above method is not representative enough, especially in small sample scenarios. Random sampling or selecting the sample data obtained by selecting the first n rows is difficult to represent the overall data distribution and difficult to provide help for subsequent data processing.
发明内容Summary of the invention
本申请提供了一种数据采样方法,该方法能够获取接近全局数据分布的样本数据,从而大幅提升样本数据的代表性。本申请还提供了该方法对应的数据采样***、计算设备集群、计算机可读存储介质以及计算机程序产品。The present application provides a data sampling method, which can obtain sample data close to the global data distribution, thereby greatly improving the representativeness of the sample data. The present application also provides a data sampling system, a computing device cluster, a computer-readable storage medium, and a computer program product corresponding to the method.
第一方面,本申请提供了一种数据采样方法。该方法可以由数据采样***执行。数据采样***可以是软件***,该软件***部署在计算设备集群中。计算设备集群执行数据采样***的程序代码,从而执行本申请的数据采样方法。在一些可能的实现方式中,数据采样***也可以是硬件***,例如是具有数据采样功能的计算设备集群。In a first aspect, the present application provides a data sampling method. The method may be performed by a data sampling system. The data sampling system may be a software system, which is deployed in a computing device cluster. The computing device cluster executes the program code of the data sampling system, thereby performing the data sampling method of the present application. In some possible implementations, the data sampling system may also be a hardware system, such as a computing device cluster with a data sampling function.
具体地,数据采样***可以获取数据集,然后确定该数据集中属性列的数量和属性值的数据类型,接着,数据采样***可以根据该数据集中属性列的数量和属性值的数据类型,从该数据集中采样获得样本集。Specifically, the data sampling system can obtain a data set, and then determine the number of attribute columns and the data types of attribute values in the data set. Then, the data sampling system can sample from the data set to obtain a sample set based on the number of attribute columns and the data types of attribute values in the data set.
该方法基于大量原始数据中属性列的数量和属性值的数据类型进行数据采样,从而获取接近全局数据分布的样本数据,能够提升样本数据的代表性,同时使得样本数据更加适用于数据预览场景,便于用户根据样本数据进行后续的数据处理。This method performs data sampling based on the number of attribute columns and the data type of attribute values in a large amount of original data, thereby obtaining sample data that is close to the global data distribution. This can improve the representativeness of the sample data and make the sample data more suitable for data preview scenarios, making it easier for users to perform subsequent data processing based on the sample data.
在一些可能的实现方式中,数据采样***还可以向用户呈现样本集,或者根据样本集,通过人工智能AI算法进行数据分析。In some possible implementations, the data sampling system may also present a sample set to a user, or perform data analysis through an artificial intelligence (AI) algorithm based on the sample set.
通过该方法采样获得的样本集可以用于数据预览或数据处理等多个场景,用户可以通过数据预览了解数据集的数据分布情况,或者通过样本集确定对数据集的后续处理操作。The sample set obtained by sampling this method can be used in multiple scenarios such as data preview or data processing. Users can understand the data distribution of the data set through data preview, or determine subsequent processing operations on the data set through the sample set.
在一些可能的实现方式中,属性列的数量为1,属性值的数据类型为离散型时,数据采样***可以根据数据集中属性列对应的多种属性值在数据集中出现的比例,采样获得样本集。In some possible implementations, when the number of attribute columns is 1 and the data type of the attribute value is discrete, the data sampling system may obtain a sample set based on the proportions of multiple attribute values corresponding to the attribute columns in the data set that appear in the data set.
针对单属性列的离散型数据,该方法可以根据属性值出现的比例进行数据采样,提升样本数据的代表性。For discrete data with a single attribute column, this method can perform data sampling based on the proportion of attribute values to improve the representativeness of the sample data.
在一些可能的实现方式中,数据集中包括n行数据,样本集包括m行数据,样本集为数据集的子集,数据采样***可以获取多种属性值中第i种属性值在数据集中出现的第一次数,接着根据数据集和样本集的大小之比与数据第一次数的乘积,确定第i种属性值在样本集中出现的次数。 In some possible implementations, the data set includes n rows of data, the sample set includes m rows of data, and the sample set is a subset of the data set. The data sampling system can obtain the first occurrence number of the i-th attribute value among multiple attribute values in the data set, and then determine the number of occurrences of the i-th attribute value in the sample set based on the ratio of the sizes of the data set and the sample set multiplied by the first occurrence number of the data.
针对单属性列的离散型数据,该方法采样获得的样本集能够表示数据集中多种属性值的分布情况,从而便于用户了解数据集中的全局数据分布。For discrete data with a single attribute column, the sample set obtained by this method can represent the distribution of multiple attribute values in the data set, making it easier for users to understand the global data distribution in the data set.
在一些可能的实现方式中,该乘积为整数时,数据采样***可以确定第i种属性值在样本集中出现的次数为该整数。该乘积非整数时,数据采样***可以根据样本集与数据集在该乘积向上取整和向下取整后的距离差,将该乘积向上取整或向下取整后的整数确定为第i种属性值在样本集中出现的次数。In some possible implementations, when the product is an integer, the data sampling system may determine the number of times the i-th attribute value appears in the sample set to be the integer. When the product is not an integer, the data sampling system may determine the number of times the i-th attribute value appears in the sample set to be an integer after rounding the product up or down based on the distance difference between the sample set and the data set after the product is rounded up or down.
针对单属性列的离散型数据,当出现属性值在样本集中出现的次数为非整数时,该方法通过借鉴KL散度的思想,对向上取整和向下取整后的样本集的距离进行评估,从而确定样本集中属性值出现的次数,如此,能够减小样本集中数据分布情况与初始样本集中数据分布情况的不同程度。For discrete data with a single attribute column, when the number of times an attribute value appears in a sample set is non-integer, this method uses the idea of KL divergence to evaluate the distance of the sample set after rounding up and rounding down, thereby determining the number of times the attribute value appears in the sample set. In this way, the degree of difference between the data distribution in the sample set and the data distribution in the initial sample set can be reduced.
在一些可能的实现方式中,属性列的数量为1,属性值的数据类型为连续型时,数据采样***可以将属性列对应的多个属性值排序,接着从排序后的序列中,按照目标间隔选取数据获得样本集。In some possible implementations, when the number of attribute columns is 1 and the data type of the attribute value is continuous, the data sampling system may sort the multiple attribute values corresponding to the attribute column, and then select data from the sorted sequence according to the target interval to obtain a sample set.
针对单属性列的连续型数据,该方法以目标间隔在属性值排序后的序列中选取样本集,从而使得样本集能够反映数据集中的数据分布情况,保证样本集中数据具有代表性。For continuous data with a single attribute column, this method selects a sample set from the sequence of sorted attribute values at a target interval, so that the sample set can reflect the data distribution in the data set and ensure that the data in the sample set is representative.
在一些可能的实现方式中,属性列的数量大于1时,数据采样***可以根据多个属性列中各个属性列对应的属性值的数据类型,选取各个属性列的第一列样本集合,接着从数据集中选取m行数据,获得初始行样本数据。其中,初始行样本数据中各个属性列对应的属性值形成各个属性列的第二列样本集合。针对m行数据之外的目标行数据,数据采样***可以确定目标行替换初始行样本数据中的一行数据时至少一个属性列的第一列样本集合与第二列样本集合的距离,并根据该距离,确定是否执行对初始行样本数据的替换操作,获得样本集。In some possible implementations, when the number of attribute columns is greater than 1, the data sampling system may select the first column sample set of each attribute column according to the data type of the attribute value corresponding to each attribute column in the multiple attribute columns, and then select m rows of data from the data set to obtain the initial row sample data. The attribute values corresponding to each attribute column in the initial row sample data form the second column sample set of each attribute column. For target row data other than the m rows of data, the data sampling system may determine the distance between the first column sample set and the second column sample set of at least one attribute column when the target row replaces a row of data in the initial row sample data, and determine whether to perform a replacement operation on the initial row sample data based on the distance to obtain a sample set.
针对多属性列的数据,该方法基于贪心算法进行数据采样,通过目标行替换初始行样本数据,使得采样得到的样本集的数据分布情况更接近数据集中的数据分布情况,如此,提升样本数据的代表性。For data with multiple attribute columns, this method performs data sampling based on a greedy algorithm. The target row replaces the initial row sample data, so that the data distribution of the sample set obtained by sampling is closer to the data distribution in the data set, thereby improving the representativeness of the sample data.
在一些可能的实现方式中,当至少一个属性列的第一列样本集合与第二列样本集合的距离之和大于替换前各个属性列的第一列样本集合与第二列样本集合的距离之和,数据采样***可以拒绝执行目标行替换初始行样本数据中的一行数据的操作。In some possible implementations, when the sum of the distances between the first column sample set and the second column sample set of at least one attribute column is greater than the sum of the distances between the first column sample set and the second column sample set of each attribute column before replacement, the data sampling system may refuse to perform an operation of replacing a row of data in the initial row sample data with a target row.
针对多属性列的数据,该方法在进行替换操作后的样本集无法更好地反映初始样本集中的数据分布情况时,不进行替换操作,从而保证样本数据能够更好地反映出数据集中的数据分布情况。For data with multiple attribute columns, this method does not perform a replacement operation when the sample set after the replacement operation cannot better reflect the data distribution in the initial sample set, thereby ensuring that the sample data can better reflect the data distribution in the data set.
在一些可能的实现方式中,至少一个属性列包括第i列,第i列对应的属性值为离散型时,数据采样***可以确定第i列的各种属性值在第i列的第一列样本集合和替换后的第二列样本集合中出现次数的差值,接着根据该差值的绝对值之和,确定目标行替换初始行样本数据中的一行数据时至少一个属性列的第一列样本集合与第二列样本集合的距离。In some possible implementations, when at least one attribute column includes the i-th column and the attribute value corresponding to the i-th column is discrete, the data sampling system can determine the difference in the number of occurrences of various attribute values of the i-th column in the first column sample set of the i-th column and the second column sample set after replacement, and then determine the distance between the first column sample set and the second column sample set of at least one attribute column when the target row replaces a row of data in the initial row sample data based on the sum of the absolute values of the difference.
针对多属性列中的离散型数据,该方法通过借鉴KS统计量的思想,基于替换前后属性值出现次数的差值确定距离,以判断是否执行替换操作,从而采样获得样本集,能够提升样本数据的代表性。For discrete data in multi-attribute columns, this method draws on the idea of KS statistics and determines the distance based on the difference in the number of attribute value occurrences before and after replacement to determine whether to perform the replacement operation, thereby sampling to obtain a sample set, which can improve the representativeness of the sample data.
在一些可能的实现方式中,至少一个属性列包括第i列,第i列对应的属性值为连续型时,数据采样***可以将第i列的第一列样本集合和替换后的第二列样本集合按照相同方式排序,并确定排序后的第一列样本集合中元素与排序后的第二列样本集合中相应的元素的差值,接着,数据采样***可以根据该差值的绝对值之和,确定目标行替换初始行样本数据中的一行数据时至少一个属性列的第一列样本集合与第二列样本集合的距离。In some possible implementations, when at least one attribute column includes the i-th column and the attribute value corresponding to the i-th column is continuous, the data sampling system may sort the first column sample set of the i-th column and the replaced second column sample set in the same way, and determine the difference between the elements in the sorted first column sample set and the corresponding elements in the sorted second column sample set. Then, the data sampling system may determine the distance between the first column sample set and the second column sample set of at least one attribute column when the target row replaces a row of data in the initial row sample data based on the sum of the absolute values of the differences.
针对多属性列中的连续型数据,该方法通过借鉴KS统计量的思想,基于替换前后元素的差值确定距离,以判断是否执行替换操作,从而采样获得样本集,能够提升样本数据的代表性。For continuous data in multi-attribute columns, this method draws on the idea of KS statistics and determines the distance based on the difference between the elements before and after replacement to determine whether to perform the replacement operation, thereby sampling to obtain a sample set, which can improve the representativeness of the sample data.
在一些可能的实现方式中,数据采样***可以对数据集进行两阶段采样。其中,第一阶段采样可以采用随机采样,第二阶段采样可以根据数据集中属性列的属相和属性值的数据类型进行采样。如此,可以有效降低原始数据集的数据规模,使得计算资源有限的设备也可以对数据集进行后续第二阶段采样,从而在大数据场景下实现数据预览的功能。In some possible implementations, the data sampling system can perform two-stage sampling on the data set. The first stage sampling can adopt random sampling, and the second stage sampling can be performed according to the attributes of the attribute columns in the data set and the data types of the attribute values. In this way, the data size of the original data set can be effectively reduced, so that devices with limited computing resources can also perform subsequent second-stage sampling on the data set, thereby realizing the function of data preview in the big data scenario.
在一些可能的实现方式中,该方法可以用于数据准备,使得用户可以基于样本集确定对数据集的处理操作,由于样本集接近数据集中全局数据分布,该方法可以提升用户数据准备效率。In some possible implementations, the method can be used for data preparation, so that the user can determine the processing operation on the data set based on the sample set. Since the sample set is close to the global data distribution in the data set, the method can improve the user's data preparation efficiency.
在一些可能的实现方式中,该方法可以用于机器学习训练数据的选取,由于用户可以通过样本集了解数据集中的数据分布情况,该方法可以使得用户选取具有代表性的数据进行数据标注。 In some possible implementations, the method can be used for selecting machine learning training data. Since users can understand the data distribution in the data set through the sample set, the method can enable users to select representative data for data annotation.
第二方面,本申请提供了一种数据采样***。所述***包括:In a second aspect, the present application provides a data sampling system. The system comprises:
获取模块,用于获取数据集;The acquisition module is used to obtain the data set;
确定模块,用于确定所述数据集中属性列的数量和属性值的数据类型;A determination module, used to determine the number of attribute columns and the data types of attribute values in the data set;
采样模块,用于根据所述属性列的数量以及所述属性值的数据类型,从所述数据集中采样获得样本集。The sampling module is used to obtain a sample set by sampling from the data set according to the number of the attribute columns and the data type of the attribute value.
在一些可能的实现方式中,所述***还包括:In some possible implementations, the system further includes:
交互模块,用于向用户呈现所述样本集;或者,an interaction module, configured to present the sample set to a user; or,
数据分析模块,用于根据所述样本集,通过人工智能AI算法进行数据分析。The data analysis module is used to perform data analysis based on the sample set through an artificial intelligence (AI) algorithm.
在一些可能的实现方式中,所述采样模块具体用于:In some possible implementations, the sampling module is specifically used to:
所述属性列的数量为1,所述属性值的数据类型为离散型时,根据所述数据集中所述属性列对应的多种属性值在所述数据集中出现的比例,采样获得样本集。When the number of the attribute columns is 1 and the data type of the attribute value is discrete, a sample set is obtained by sampling according to the proportions of the multiple attribute values corresponding to the attribute columns in the data set appearing in the data set.
在一些可能的实现方式中,所述数据集中包括n行数据,所述样本集包括m行数据,所述样本集为所述数据集的子集,所述采样模块具体用于:In some possible implementations, the data set includes n rows of data, the sample set includes m rows of data, the sample set is a subset of the data set, and the sampling module is specifically used to:
获取所述多种属性值中第i种属性值在所述数据集中出现的第一次数;Obtain the first occurrence number of the i-th attribute value among the multiple attribute values in the data set;
根据所述数据集和所述样本集的大小之比与所述数据第一次数的乘积,确定所述第i种属性值在所述样本集中出现的次数。The number of times the i-th attribute value appears in the sample set is determined according to the product of the ratio of the size of the data set to the size of the sample set and the first number of the data.
在一些可能的实现方式中,所述采样模块具体用于:In some possible implementations, the sampling module is specifically used to:
所述乘积为整数时,确定所述第i种属性值在所述样本集中出现的次数为所述整数;When the product is an integer, determining the number of times the i-th attribute value appears in the sample set is the integer;
所述乘积非整数时,根据所述样本集与所述数据集在所述乘积向上取整和向下取整后的距离差,将所述乘积向上取整或向下取整后的整数确定为所述第i种属性值在所述样本集中出现的次数。When the product is not an integer, according to the distance difference between the sample set and the data set after the product is rounded up or rounded down, the integer after the product is rounded up or rounded down is determined as the number of times the i-th attribute value appears in the sample set.
在一些可能的实现方式中,所述采样模块具体用于:In some possible implementations, the sampling module is specifically used to:
所述属性列的数量为1,所述属性值的数据类型为连续型时,将所述属性列对应的多个属性值排序;When the number of the attribute columns is 1 and the data type of the attribute value is continuous, sort the multiple attribute values corresponding to the attribute column;
从排序后的序列中,按照目标间隔选取数据获得样本集。From the sorted sequence, data is selected according to the target interval to obtain a sample set.
在一些可能的实现方式中,所述采样模块具体用于:In some possible implementations, the sampling module is specifically used to:
所述属性列的数量大于1时,根据多个属性列中各个属性列对应的属性值的数据类型,选取各个属性列的第一列样本集合;When the number of the attribute columns is greater than 1, selecting the first column sample set of each attribute column according to the data type of the attribute value corresponding to each attribute column in the multiple attribute columns;
从所述数据集中选取m行数据,获得初始行样本数据,所述初始行样本数据中各个属性列对应的属性值形成各个属性列的第二列样本集合;Select m rows of data from the data set to obtain initial row sample data, wherein the attribute values corresponding to each attribute column in the initial row sample data form a second column sample set of each attribute column;
针对所述m行数据之外的目标行数据,确定所述目标行替换所述初始行样本数据中的一行数据时至少一个属性列的第一列样本集合与第二列样本集合的距离;For target row data other than the m rows of data, determining a distance between a first column sample set and a second column sample set of at least one attribute column when the target row replaces a row of data in the initial row sample data;
根据所述距离,确定是否执行对所述初始行样本数据的替换操作,获得样本集。According to the distance, it is determined whether to perform a replacement operation on the initial row sample data to obtain a sample set.
在一些可能的实现方式中,所述采样模块具体用于:In some possible implementations, the sampling module is specifically used to:
当所述至少一个属性列的第一列样本集合与第二列样本集合的距离之和大于替换前各个属性列的第一列样本集合与第二列样本集合的距离之和,拒绝执行所述目标行替换所述初始行样本数据中的一行数据的操作。When the sum of the distances between the first column sample set and the second column sample set of the at least one attribute column is greater than the sum of the distances between the first column sample set and the second column sample set of each attribute column before replacement, the operation of replacing a row of data in the initial row sample data with the target row is rejected.
在一些可能的实现方式中,所述采样模块具体用于:In some possible implementations, the sampling module is specifically used to:
所述至少一个属性列包括第i列,所述第i列对应的属性值为离散型时,确定所述第i列的各种属性值在所述第i列的第一列样本集合和替换后的第二列样本集合中出现次数的差值;The at least one attribute column includes an i-th column, and when the attribute value corresponding to the i-th column is discrete, determining the difference in the number of occurrences of various attribute values of the i-th column in a first column sample set and a second column sample set after replacement of the i-th column;
根据所述差值的绝对值之和,确定所述目标行替换所述初始行样本数据中的一行数据时至少一个属性列的第一列样本集合与第二列样本集合的距离。According to the sum of the absolute values of the differences, a distance between a first column sample set and a second column sample set of at least one attribute column when the target row replaces a row of data in the initial row sample data is determined.
在一些可能的实现方式中,所述采样模块具体用于:In some possible implementations, the sampling module is specifically used to:
所述至少一个属性列包括第i列,所述第i列对应的属性值为连续型时,将所述第i列的第一列样本集合和替换后的第二列样本集合按照相同方式排序;The at least one attribute column includes an i-th column, and when the attribute value corresponding to the i-th column is continuous, the first column sample set and the replaced second column sample set of the i-th column are sorted in the same way;
确定排序后的第一列样本集合中元素与排序后的第二列样本集合中相应的元素的差值;Determine the difference between the elements in the sorted first column sample set and the corresponding elements in the sorted second column sample set;
根据所述差值的绝对值之和,确定所述目标行替换所述初始行样本数据中的一行数据时至少一个属性列的第一列样本集合与第二列样本集合的距离。 According to the sum of the absolute values of the differences, a distance between a first column sample set and a second column sample set of at least one attribute column when the target row replaces a row of data in the initial row sample data is determined.
第三方面,本申请提供了一种计算设备集群。所述计算设备集群包括至少一台计算设备,所述至少一台计算设备包括至少一个处理器和至少一个存储器。所述至少一个处理器、所述至少一个存储器进行相互的通信。所述至少一个处理器用于执行所述至少一个存储器中存储的指令,以使得计算设备或计算设备集群执行如第一方面或第一方面的任一种实现方式所述的数据采样方法。In a third aspect, the present application provides a computing device cluster. The computing device cluster includes at least one computing device, and the at least one computing device includes at least one processor and at least one memory. The at least one processor and the at least one memory communicate with each other. The at least one processor is used to execute instructions stored in the at least one memory, so that the computing device or the computing device cluster performs the data sampling method described in the first aspect or any implementation of the first aspect.
第四方面,本申请提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,所述指令指示计算设备或计算设备集群执行上述第一方面或第一方面的任一种实现方式所述的数据采样方法。In a fourth aspect, the present application provides a computer-readable storage medium, wherein the computer-readable storage medium stores instructions, wherein the instructions instruct a computing device or a computing device cluster to execute the data sampling method described in the first aspect or any one of the implementations of the first aspect.
第五方面,本申请提供了一种包含指令的计算机程序产品,当其在计算设备或计算设备集群上运行时,使得计算设备或计算设备集群执行上述第一方面或第一方面的任一种实现方式所述的数据采样方法。In a fifth aspect, the present application provides a computer program product comprising instructions, which, when executed on a computing device or a computing device cluster, enables the computing device or the computing device cluster to execute the data sampling method described in the first aspect or any one of the implementations of the first aspect.
本申请在上述各方面提供的实现方式的基础上,还可以进行进一步组合以提供更多实现方式。Based on the implementations provided in the above aspects, this application can also be further combined to provide more implementations.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
为了更清楚地说明本申请实施例的技术方法,下面将对实施例中所需使用的附图作以简单地介绍。In order to more clearly illustrate the technical method of the embodiments of the present application, the drawings required for use in the embodiments are briefly introduced below.
图1为本申请实施例提供的一种获取样本数据的方法的示意图;FIG1 is a schematic diagram of a method for acquiring sample data provided in an embodiment of the present application;
图2为本申请实施例提供的一种获取样本数据的方法的示意图;FIG2 is a schematic diagram of a method for obtaining sample data provided in an embodiment of the present application;
图3为本申请实施例提供的一种数据采样***的架构示意图;FIG3 is a schematic diagram of the architecture of a data sampling system provided in an embodiment of the present application;
图4A至4C为本申请实施例提供的一种样本集展示界面的示意图;4A to 4C are schematic diagrams of a sample set display interface provided in an embodiment of the present application;
图5为本申请实施例提供的一种数据采样方法的流程示意图;FIG5 is a flow chart of a data sampling method provided in an embodiment of the present application;
图6为本申请实施例提供的一种两阶段数据采样方法的流程示意图;FIG6 is a schematic diagram of a flow chart of a two-stage data sampling method provided in an embodiment of the present application;
图7为本申请实施例提供的一种数据采样***的结构示意图;FIG7 is a schematic diagram of the structure of a data sampling system provided in an embodiment of the present application;
图8为本申请实施例提供的一种计算设备的结构示意图;FIG8 is a schematic diagram of the structure of a computing device provided in an embodiment of the present application;
图9为本申请实施例提供的一种计算设备集群的结构示意图;FIG9 is a schematic diagram of the structure of a computing device cluster provided in an embodiment of the present application;
图10为本申请实施例提供的一种计算设备集群的结构示意图;FIG10 is a schematic diagram of the structure of a computing device cluster provided in an embodiment of the present application;
图11为本申请实施例提供的一种计算设备集群的结构示意图。FIG. 11 is a schematic diagram of the structure of a computing device cluster provided in an embodiment of the present application.
具体实施方式Detailed ways
本申请实施例中的术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征。The terms "first" and "second" in the embodiments of the present application are used for descriptive purposes only and should not be understood as indicating or implying relative importance or implicitly indicating the number of the indicated technical features. Therefore, the features defined as "first" and "second" may explicitly or implicitly include one or more of the features.
首先对本申请实施例中所涉及到的一些技术术语进行介绍。First, some technical terms involved in the embodiments of the present application are introduced.
数据准备(data preparation)是指对原始数据进行预处理,以使其更适合进一步处理和分析的过程。例如,经过数据准备处理的原始数据可以作为更加理想的已知数据输入到机器学习(machine learning,ML)算法中,以提升机器学习算法的效果。Data preparation refers to the process of preprocessing raw data to make it more suitable for further processing and analysis. For example, raw data that has been processed by data preparation can be used as more ideal known data to input into machine learning (ML) algorithms to improve the effectiveness of machine learning algorithms.
数据准备的过程包括数据采集、数据预览(data preview)、数据清洗、数据标注等多个步骤。其中,数据预览是数据准备重要的功能特性之一。数据预览是指向用户呈现大量原始数据中的小部分样本数据(也可以称为预览数据),使得用户快速了解原始数据分布的过程。用户可以通过预览数据选择下一步的数据处理或分析操作。例如,用户通过预览数据,发现重复数据条目比较多,则可以提交对该原始数据进行数据去重的任务。又例如,用户通过预览数据,发现缺失值数据条目比较多,则可以提交对该原始数据进行缺失值填充的任务。The process of data preparation includes multiple steps such as data collection, data preview, data cleaning, and data labeling. Among them, data preview is one of the important functional features of data preparation. Data preview refers to the process of presenting a small part of sample data (also called preview data) in a large amount of original data to the user, so that the user can quickly understand the distribution of the original data. Users can select the next data processing or analysis operation through preview data. For example, if the user finds that there are many duplicate data entries through previewing the data, he can submit a task to deduplicate the original data. For another example, if the user finds that there are many missing value data entries through previewing the data, he can submit a task to fill the missing values of the original data.
数据预览通常通过数据采样(data sampling)技术实现。其中,数据采样是指从原始数据的数据集总体中选择子集(也可以称为样本),以估计总体的特征的过程。数据预览的流程可以包括导入原始数据、获取样本数据和数据预览。Data preview is usually achieved through data sampling technology. Data sampling refers to the process of selecting a subset (also called a sample) from the original data set to estimate the characteristics of the whole. The process of data preview can include importing original data, obtaining sample data and data preview.
首先,用户可以从多个数据源中导入原始数据,接着,数据预览装置利用数据采样技术,从原始数据中获取样本数据,并向用户展示样本数据。其中,获取样本数据的方法可以有多种。例如,参见图1所示的一种获取样本数据的方法的示意图,原始数据包括多行数据,数据预览装置可以选取原始数据中的前n行数据作为样本数据。又例如,参见图2所示的一种获取样本数据的方法的示意图,原始数据包 括多行数据,数据预览装置可以随机选取原始数据中的n行数据作为样本数据。First, the user can import raw data from multiple data sources. Then, the data preview device uses data sampling technology to obtain sample data from the raw data and displays the sample data to the user. There are multiple methods for obtaining sample data. For example, referring to the schematic diagram of a method for obtaining sample data shown in FIG1, the raw data includes multiple rows of data, and the data preview device can select the first n rows of data in the raw data as sample data. For another example, referring to the schematic diagram of a method for obtaining sample data shown in FIG2, the raw data packet The data preview device may randomly select n lines of data from the original data as sample data.
然而,上述方法获取的样本数据的代表性不足,难以表征原始数据整体的数据分布。数据预览装置向用户展示样本数据后,难以为用户后续的数据处理提供帮助,也就是说,用户难以基于该样本数据选择合适的数据处理或分析操作。However, the sample data obtained by the above method is not representative enough and it is difficult to characterize the data distribution of the original data as a whole. After the data preview device displays the sample data to the user, it is difficult to provide assistance to the user for subsequent data processing, that is, it is difficult for the user to select appropriate data processing or analysis operations based on the sample data.
此外,在一些非数据预览(例如数据分析)的场景中,也会涉及利用数据采样技术获取样本数据的步骤。常用的数据采样技术可以包括基于过滤的数据采样方法、基于异常情况的数据采样方法、基于分层抽样的数据采样方法等。例如,当原始数据中包括属性值为“年份”的多行数据时,可以利用基于过滤的数据采样方法,获取年份属性值为“2022”的数据作为样本数据。又例如,当原始数据中包括存在缺失值的多行数据时,可以利用基于异常情况的数据采样方法,获取存在缺失值的数据作为样本数据。In addition, in some non-data preview scenarios (such as data analysis), the step of obtaining sample data using data sampling technology will also be involved. Common data sampling technologies may include data sampling methods based on filtering, data sampling methods based on abnormal situations, data sampling methods based on stratified sampling, and the like. For example, when the original data includes multiple rows of data with an attribute value of "year", a data sampling method based on filtering can be used to obtain data with a year attribute value of "2022" as sample data. For another example, when the original data includes multiple rows of data with missing values, a data sampling method based on abnormal situations can be used to obtain data with missing values as sample data.
然而,上述方法主要面向数据分析任务,采用的数据采样方法需要根据具体的分析任务进行设定,与数据预览的需求不完全匹配。同时,在小样本场景下,利用上述方法难以获取接近全局数据分布的样本数据。However, the above methods are mainly aimed at data analysis tasks, and the data sampling method used needs to be set according to the specific analysis task, which does not fully match the needs of data preview. At the same time, in small sample scenarios, it is difficult to obtain sample data close to the global data distribution using the above methods.
有鉴于此,本申请实施例提供了一种数据采样方法,该方法可以由数据采样***执行。具体地,数据采样***可以获取数据集,然后确定该数据集中属性列的数量和属性值的数据类型,接着,数据采样***可以根据该数据集中属性列的数量和属性值的数据类型,从该数据集中采样获得样本集。In view of this, an embodiment of the present application provides a data sampling method, which can be performed by a data sampling system. Specifically, the data sampling system can obtain a data set, and then determine the number of attribute columns and the data type of attribute values in the data set. Then, the data sampling system can sample from the data set to obtain a sample set according to the number of attribute columns and the data type of attribute values in the data set.
该方法基于大量原始数据中属性列的数量和属性值的数据类型进行数据采样,从而获取接近全局数据分布的样本数据,能够提升样本数据的代表性,同时使得样本数据更加适用于数据预览场景,便于用户根据样本数据进行后续的数据处理。This method performs data sampling based on the number of attribute columns and the data type of attribute values in a large amount of original data, thereby obtaining sample data that is close to the global data distribution. This can improve the representativeness of the sample data and make the sample data more suitable for data preview scenarios, making it easier for users to perform subsequent data processing based on the sample data.
为了使得本申请的技术方案更加清楚、易于理解,下面结合附图对本申请实施例的***架构进行介绍。In order to make the technical solution of the present application clearer and easier to understand, the system architecture of the embodiment of the present application is introduced below with reference to the accompanying drawings.
参见图3所示的数据采样***的***架构图,该数据采样***30与至少一个数据源10连接,并与用户20进行交互。此外,数据采样***30还可以与数据处理***40连接,以将采样的数据提供给数据处理***40进行数据分析等数据处理。3 , the data sampling system 30 is connected to at least one data source 10 and interacts with a user 20. In addition, the data sampling system 30 may also be connected to a data processing system 40 to provide the sampled data to the data processing system 40 for data analysis and other data processing.
在一些实施例中,数据采样***30可以是软件***,该软件***可以部署在计算设备集群中,计算设备集群通过运行该软件***的程序代码,以执行数据采样方法。在另一些实施例中,数据采样***30也可以是硬件***,该硬件***可以为具有数据采样功能的计算机集群。该硬件***运行时,实现本申请实施例的数据采样方法。本申请实施例以数据采样***30为软件***进行示例说明。In some embodiments, the data sampling system 30 may be a software system, which may be deployed in a computing device cluster, and the computing device cluster executes the data sampling method by running the program code of the software system. In other embodiments, the data sampling system 30 may also be a hardware system, which may be a computer cluster with a data sampling function. When the hardware system is running, the data sampling method of the embodiment of the present application is implemented. The embodiment of the present application is illustrated by taking the data sampling system 30 as a software system.
类似地,数据源10可以是提供数据集的软件或硬件。例如,数据源10可以是搜索引擎、数据库或者是其他产生有大量数据的应用。又例如,数据源10也可以是能够产生数据集或者存储有数据集的硬件。数据处理***40可以为软件***,该软件***可以部署在计算机集群中,计算机集群通过执行软件***的程序代码,从而实现数据处理功能。在一些实施例中,数据处理***40也可以是硬件***,该硬件***运行时,实现数据处理功能。Similarly, the data source 10 may be software or hardware that provides a data set. For example, the data source 10 may be a search engine, a database, or other application that generates a large amount of data. For another example, the data source 10 may also be hardware that can generate a data set or store a data set. The data processing system 40 may be a software system that may be deployed in a computer cluster that implements data processing functions by executing the program code of the software system. In some embodiments, the data processing system 40 may also be a hardware system that implements data processing functions when the hardware system is running.
具体地,数据采样***30包括样本获取装置32和数据预览装置34。下面分别对各装置进行介绍。Specifically, the data sampling system 30 includes a sample acquisition device 32 and a data preview device 34. Each device is introduced below.
样本获取装置32用于从至少一个数据源10获取数据集,从该数据集中采样,从而获得样本集。其中,样本获取装置32可以根据数据集中属性列的数量和属性值的数据类型,从该数据集中获取样本集。The sample acquisition device 32 is used to acquire a data set from at least one data source 10 and sample from the data set to obtain a sample set. The sample acquisition device 32 can acquire the sample set from the data set according to the number of attribute columns in the data set and the data type of the attribute value.
上述属性列的数量可以包括1、大于1,属性值的数据类型可以包括离散型、连续型。当数据集中属性列的数量为1,属性值的数据类型为离散型时,样本获取装置32可以根据上述属性列对应的多种属性值在该数据集中出现的比例,从该数据集中采样,获得样本集。类似的,当数据集中属性列的数量为1,属性值的数据类型为连续型时,样本获取装置32可以将上述属性列对应的多个属性值排序后,按照目标间隔选取数据进行采样,从而获得样本集。当数据集中属性列的数量大于1时,样本获取装置32可以根据选取的初始行样本数据和目标行数据之间的距离,确定是否对该初始行样本数据进行替换,从而获得样本集。The number of the above-mentioned attribute columns may include 1 or be greater than 1, and the data type of the attribute value may include discrete type and continuous type. When the number of attribute columns in the data set is 1 and the data type of the attribute value is discrete type, the sample acquisition device 32 may sample from the data set according to the proportion of the multiple attribute values corresponding to the above-mentioned attribute columns in the data set to obtain a sample set. Similarly, when the number of attribute columns in the data set is 1 and the data type of the attribute value is continuous type, the sample acquisition device 32 may sort the multiple attribute values corresponding to the above-mentioned attribute columns, select data for sampling according to the target interval, and thus obtain a sample set. When the number of attribute columns in the data set is greater than 1, the sample acquisition device 32 may determine whether to replace the initial row sample data according to the distance between the selected initial row sample data and the target row data, and thus obtain a sample set.
数据预览装置34用于向用户20呈现样本集,以实现数据预览的功能。具体地,数据预览装置34可以提供交互界面,该交互界面也可以称作用户接口(user interface,UI)界面,数据预览装置34可以通过UI界面实现数据预览。其中,交互界面可以包括图形化用户界面(graphical user interface,GUI) 或者是命令用户界面(command user interface,CUI)。The data preview device 34 is used to present the sample set to the user 20 to realize the data preview function. Specifically, the data preview device 34 can provide an interactive interface, which can also be called a user interface (UI) interface. The data preview device 34 can realize data preview through the UI interface. Among them, the interactive interface can include a graphical user interface (GUI). Or command user interface (CUI).
数据预览装置34提供的UI界面可以包括样本集展示界面。参见图4A至图4C所示的样本集展示界面400,数据预览装置34可以通过样本集展示界面400,向用户20呈现通过样本获取模块32获得的样本集,使得用户20可以基于该样本集了解数据集的分布情况。The UI interface provided by the data preview device 34 may include a sample set display interface. Referring to the sample set display interface 400 shown in FIG. 4A to FIG. 4C , the data preview device 34 may present the sample set obtained by the sample acquisition module 32 to the user 20 through the sample set display interface 400, so that the user 20 can understand the distribution of the data set based on the sample set.
具体地,如图4A所示,样本集展示界面400的左侧为数据源选取界面。其中,数据源选取界面包括数据源搜索框402和数据源展示界面404。用户20可以通过数据源搜索框402搜索目标数据源,并从目标数据源中选择需要进行数据预览的目标数据集,也可以从数据源展示界面404中直接点击目标数据源,并在目标数据源中确定目标数据集,从而完成目标数据集的选取。样本集展示界面400的右侧包括数据采样方式选择区域406和数据预览区域408,用户20可以通过数据预览方式选择区域406选择目标数据集的数据采样方式,并通过数据预览区域408对目标数据集的样本集进行数据预览。Specifically, as shown in FIG4A , the left side of the sample set display interface 400 is a data source selection interface. The data source selection interface includes a data source search box 402 and a data source display interface 404. The user 20 can search for the target data source through the data source search box 402 and select the target data set that needs to be previewed from the target data source, or directly click the target data source from the data source display interface 404 and determine the target data set in the target data source to complete the selection of the target data set. The right side of the sample set display interface 400 includes a data sampling method selection area 406 and a data preview area 408. The user 20 can select the data sampling method of the target data set through the data preview method selection area 406 and preview the data of the sample set of the target data set through the data preview area 408.
如图4B所示,用户20选取数据源1中的数据集1进行数据预览,此时,数据集1所在区域的背景可以变为深色,表示该数据集为目标数据集。进一步地,用户20可以通过点击数据采样方式选择区域406中的下拉标记获取下拉菜单407,选择数据集1的数据采样方式。其中,数据采样方式可以包括随机选取、选取前n行、按条件选取和选取代表性样本,选取代表性样本即为本申请实施例中提供的数据采样方法。用户20可以通过点击下拉菜单407中的数据采样方式,确定对数据集1的采样方式。As shown in FIG4B , user 20 selects data set 1 in data source 1 for data preview. At this time, the background of the area where data set 1 is located can become dark, indicating that the data set is the target data set. Further, user 20 can obtain a drop-down menu 407 by clicking the drop-down mark in the data sampling method selection area 406 to select the data sampling method of data set 1. Among them, the data sampling method can include random selection, selecting the first n rows, selecting by condition, and selecting representative samples. Selecting representative samples is the data sampling method provided in the embodiment of the present application. User 20 can determine the sampling method for data set 1 by clicking the data sampling method in the drop-down menu 407.
如图4C所示,用户20选择“选取代表性样本”作为数据集1的数据采样方式,此时,用户20可以通过数据预览区域408对数据集1的样本集进行数据预览。其中,数据集1的样本集是通过“选取代表性样本”的数据采样方式进行采样获得的。As shown in FIG4C , the user 20 selects “select representative samples” as the data sampling method of the data set 1. At this time, the user 20 can preview the sample set of the data set 1 through the data preview area 408. The sample set of the data set 1 is obtained by sampling through the data sampling method of “select representative samples”.
数据采样***30对数据集进行采样,并向用户20呈现样本集后,可以将该数据集提供至数据处理***40,如此,数据处理***40可以根据用户20选定的操作对数据集进行后续处理。例如,当数据采样***30用于数据准备时,用户20可以基于样本集,确定数据处理***40对数据集的处理操作,如去重操作、缺失值填充操作。又例如,当数据采样***30部署在数据中台时,用户20可以基于数据预览的结果,选择需要对数据集进行的处理操作,以便数据处理***40更好地管理数据。再例如,当数据采样***30用于机器学习训练数据的选取时,用户20可以基于样本集,了解数据集中的数据分布情况,从而选取具有代表性的数据进行数据标注。After the data sampling system 30 samples the data set and presents the sample set to the user 20, the data set can be provided to the data processing system 40, so that the data processing system 40 can perform subsequent processing on the data set according to the operation selected by the user 20. For example, when the data sampling system 30 is used for data preparation, the user 20 can determine the processing operations of the data processing system 40 on the data set based on the sample set, such as deduplication operations and missing value filling operations. For another example, when the data sampling system 30 is deployed in the data middle station, the user 20 can select the processing operations that need to be performed on the data set based on the results of the data preview so that the data processing system 40 can better manage the data. For another example, when the data sampling system 30 is used to select machine learning training data, the user 20 can understand the data distribution in the data set based on the sample set, so as to select representative data for data annotation.
接下来,从数据采样******30的角度,结合附图对本申请实施例的数据采样方法进行详细介绍。Next, from the perspective of the data sampling system 30, the data sampling method of the embodiment of the present application is introduced in detail with reference to the accompanying drawings.
参见图5所示的数据采样方法的流程图,该方法包括:Referring to the flow chart of the data sampling method shown in FIG5 , the method includes:
S502:数据采样***30获取数据集。S502: The data sampling system 30 acquires a data set.
数据集是指由数据所组成的集合。在本实施例中,数据集可以采用表格的形式存储数据。其中,表格的每一列代表一个属性值,每一行表示某一成员的数据。A data set refers to a collection of data. In this embodiment, the data set can store data in the form of a table, wherein each column of the table represents an attribute value, and each row represents the data of a member.
具体地,数据采样***30可以从至少一个数据源中获取数据集。其中,数据源可以包括搜索引擎、数据库或者其他能够提供数据集的应用等,本申请实施例不对数据源的类型做出限定。Specifically, the data sampling system 30 may obtain a data set from at least one data source, wherein the data source may include a search engine, a database, or other applications that can provide a data set, and the embodiment of the present application does not limit the type of the data source.
S504:数据采样***30确定数据集中属性列的数量和属性值的数据类型。S504: The data sampling system 30 determines the number of attribute columns and the data types of attribute values in the data set.
数据集中的列可以包括索引列和属性列。其中,索引列的数量可以为1,索引列的作用相当于目录,用户可以根据索引列的索引值快速找到数据集中对应的内容。数据集中的每一行数据的索引值唯一,从而可以保证数据集中每一行数据的唯一性。例如,索引列可以包括编号、用户标识等。当索引列为编号时,索引列对应的索引值可以为1、2、……、n。当索引列为用户标识时,索引列对应的索引值可以为U01、U02、……、Unn。利用索引列,数据采样***30可以对数据集中的数据进行检索和标识。The columns in the data set may include index columns and attribute columns. The number of index columns may be 1, and the role of the index column is equivalent to a directory. The user can quickly find the corresponding content in the data set according to the index value of the index column. The index value of each row of data in the data set is unique, thereby ensuring the uniqueness of each row of data in the data set. For example, the index column may include a number, a user ID, etc. When the index column is a number, the index value corresponding to the index column may be 1, 2, ..., n. When the index column is a user ID, the index value corresponding to the index column may be U01, U02, ..., Unn. Using the index column, the data sampling system 30 can retrieve and identify the data in the data set.
属性列用于表示不同属性的属性值。例如,属性列可以包括供应商、云服务、城市、价格等。其中,属性列的数量可以为1,也可以大于1。属性列的数量为1表示数据集中仅存在一个属性列,属性列的数量大于1表示数据集中存在多个属性列。属性列对应的属性值的数据类型可以包括离散型和连续型。其中,属性值的数据类型为离散型表示该属性值为离散数据,例如,离散型属性值可以包括属性列为供应商、云服务、城市时对应的属性值。属性值的数据类型为连续型表示该属性值为连续数据,可以在属性值所属区间内任意进行取值,例如,连续型属性值可以包括属性列为收入时对应的属性值。Attribute columns are used to represent attribute values of different attributes. For example, attribute columns may include supplier, cloud service, city, price, etc. The number of attribute columns may be 1 or greater than 1. The number of attribute columns being 1 indicates that there is only one attribute column in the data set, and the number of attribute columns being greater than 1 indicates that there are multiple attribute columns in the data set. The data types of attribute values corresponding to attribute columns may include discrete and continuous types. The data type of an attribute value being discrete indicates that the attribute value is discrete data. For example, discrete attribute values may include attribute values corresponding to the attribute columns being supplier, cloud service, and city. The data type of an attribute value being continuous indicates that the attribute value is continuous data and may take any value within the interval to which the attribute value belongs. For example, continuous attribute values may include attribute values corresponding to the attribute column being income.
S506:数据采样***30根据数据集中属性列的数量和属性值的数据类型,从数据集中采样获得样 本集。S506: The data sampling system 30 samples the data set according to the number of attribute columns in the data set and the data types of the attribute values. This episode.
在一些可能的实现方式中,数据集中包括的数据量较大。例如,该数据集为大数据应用场景下的数据集,该数据集中的数据量可以达到1亿。此时,若数据采样***30直接将数据集载入内存,并从数据集中采样获得样本集,需要耗费较长的采样时间,同时,计算成本较大,可能造成内存溢出的问题。因此,本申请实施例提供了一种两阶段数据采样方法。In some possible implementations, the amount of data included in the data set is large. For example, the data set is a data set in a big data application scenario, and the amount of data in the data set can reach 100 million. At this time, if the data sampling system 30 directly loads the data set into the memory and samples the data set to obtain a sample set, it takes a long sampling time and has a high computational cost, which may cause a memory overflow problem. Therefore, an embodiment of the present application provides a two-stage data sampling method.
下面,结合附图对两阶段数据采样方法进行详细说明。The two-stage data sampling method is described in detail below with reference to the accompanying drawings.
参见图6所示的两阶段数据采样方法的流程示意图,数据采样***30可以对数据集进行第一阶段采样和第二阶段采样。首先,数据采样***30可以对数据集进行第一阶段采样,获得初始样本集,接着,数据采样***30可以对初始样本集进行第二阶段采样,获得样本集。6, the data sampling system 30 can perform first-stage sampling and second-stage sampling on the data set. First, the data sampling system 30 can perform first-stage sampling on the data set to obtain an initial sample set, and then the data sampling system 30 can perform second-stage sampling on the initial sample set to obtain a sample set.
具体地,数据采样***30可以利用随机采样(random sampling)技术,从数据集中采样获得初始样本集。其中,初始样本集中的样本大小可以大于样本集中的样本大小。例如,当样本集用于数据预览时,由于预览界面存在限制,样本集中的样本大小最大为1000,此时,初始样本集中的样本大小可以为10万。Specifically, the data sampling system 30 may use random sampling technology to obtain an initial sample set from the data set. The sample size in the initial sample set may be larger than the sample size in the sample set. For example, when the sample set is used for data preview, due to limitations in the preview interface, the maximum sample size in the sample set is 1000. In this case, the sample size in the initial sample set may be 100,000.
随机采样是指按照给定的比例或者样本大小对数据集进行采样,从而等概率地从数据集中获得初始样本集。具体地,数据采样***30可以预先设定采样比例,根据采样比例对数据集进行随机采样,获得初始样本集。其中,采样比例可以为0至1之间的任意实数。例如,数据采样***30可以设定采样比例为0.03,当数据集中的数据量为1亿时,经过随机采样获得的初始样本集中的样本大小为300万。在一些实施例中,数据采样***30页可以预先设定采样数量,根据采样数量确定采样比例,再基于采样比例对数据集进行随机采样,获得初始样本集。例如,数据采样***30可以设定采样数量为300万,当数据集中的数据量为1亿时,数据采样***30根据采样数量与数据集中的数据量的比值,确定出采样比例为0.03,从而根据采样比例进行随机采样,获得样本大小为300万的初始样本集。Random sampling refers to sampling a data set according to a given ratio or sample size, so as to obtain an initial sample set from the data set with equal probability. Specifically, the data sampling system 30 can pre-set the sampling ratio, and randomly sample the data set according to the sampling ratio to obtain the initial sample set. Among them, the sampling ratio can be any real number between 0 and 1. For example, the data sampling system 30 can set the sampling ratio to 0.03. When the amount of data in the data set is 100 million, the sample size in the initial sample set obtained by random sampling is 3 million. In some embodiments, the data sampling system 30 can pre-set the sampling number, determine the sampling ratio according to the sampling number, and then randomly sample the data set based on the sampling ratio to obtain the initial sample set. For example, the data sampling system 30 can set the sampling number to 3 million. When the amount of data in the data set is 100 million, the data sampling system 30 determines that the sampling ratio is 0.03 based on the ratio of the sampling number to the amount of data in the data set, so as to perform random sampling according to the sampling ratio to obtain an initial sample set with a sample size of 3 million.
经过第一阶段采样,可以有效降低原始数据集的数据规模,使得计算资源有限的设备也可以对数据集进行后续第二阶段采样,从而在大数据场景下实现数据预览的功能。After the first stage of sampling, the data size of the original data set can be effectively reduced, so that devices with limited computing resources can also perform subsequent second-stage sampling on the data set, thereby realizing the data preview function in the big data scenario.
需要说明的是,本申请实施例中以利用随机采样进行第一阶段采样作为示例进行说明,在一些可能的实现方式中,数据采样***30可以利用不同的采样方式进行第一阶段采样,本申请实施例对此不做限制。It should be noted that the embodiment of the present application uses random sampling for the first stage sampling as an example for explanation. In some possible implementations, the data sampling system 30 may use different sampling methods for the first stage sampling, and the embodiment of the present application does not limit this.
进一步地,本申请实施例中提供的第一阶段采样为可选步骤,在一些可能的实现方式中,数据采样***30可以不进行第一阶段采样,直接对数据集进行第二阶段采样,本申请实施例对此不做限制。Furthermore, the first stage sampling provided in the embodiment of the present application is an optional step. In some possible implementations, the data sampling system 30 may not perform the first stage sampling and directly perform the second stage sampling on the data set. The embodiment of the present application does not limit this.
在第一阶段采样后,数据采样***30可以对初始样本集进行第二阶段采样,获得样本集。具体地,数据采样***30可以根据属性列的数量和属性值的数据类型,从初始样本集中采样获得样本集。After the first stage sampling, the data sampling system 30 may perform the second stage sampling on the initial sample set to obtain the sample set. Specifically, the data sampling system 30 may obtain the sample set by sampling from the initial sample set according to the number of attribute columns and the data type of the attribute value.
下面,将根据属性列的数量和属性值的数据类型,分为三种情况对第二阶段采样的过程进行说明。The following describes the second-stage sampling process in three cases based on the number of attribute columns and the data type of the attribute values.
情况一:单属性列的离散型数据。当属性列的数量为1,属性值的数据类型为离散型时,数据采样***30可以根据初始样本集中属性列对应的多种属性值在初始样本集中出现的比例,采样获得样本集。Case 1: Discrete data of a single attribute column: When the number of attribute columns is 1 and the data type of the attribute value is discrete, the data sampling system 30 can obtain a sample set by sampling according to the proportions of multiple attribute values corresponding to the attribute column in the initial sample set.
在一些可能的实现方式中,初始样本集中包括n行数据,样本集包括m行数据,样本集为初始样本集的子集。数据采样***30可以获取属性列对应的多种属性值中第i种属性值在初始样本集中出现的第一次数,接着根据初始样本集和样本集的大小之比与第一次数的乘积,确定第i种属性值在样本集中出现的次数。In some possible implementations, the initial sample set includes n rows of data, the sample set includes m rows of data, and the sample set is a subset of the initial sample set. The data sampling system 30 can obtain the first occurrence number of the i-th attribute value in the initial sample set among the multiple attribute values corresponding to the attribute column, and then determine the number of occurrences of the i-th attribute value in the sample set based on the product of the ratio of the size of the initial sample set to the sample set and the first occurrence number.
其中,当上述乘积为整数时,数据采样***30可以确定第i种属性值在样本集中出现的次数为该整数。当上述乘积为非整数时,数据采样***30可以根据样本集与初始样本集在上述乘积向上取整和向下取整后的距离差,将上述乘积向上取整或向下取整后的整数确定为第i种属性值在样本集中出现的次数。When the product is an integer, the data sampling system 30 can determine the number of times the i-th attribute value appears in the sample set to be the integer. When the product is a non-integer, the data sampling system 30 can determine the number of times the i-th attribute value appears in the sample set as the integer after rounding up or rounding down the product based on the distance difference between the sample set and the initial sample set after rounding up or rounding down the product.
例如,将第i种属性值在初始样本集中出现的第一次数记为ni,第i种属性值在样本集中出现的次数记为mi。对于第i种属性值,初始样本集和样本集的大小之比与第一次数的乘积可以表示为mni/n。其中,m/n表示初始样本集和样本集的大小之比。For example, the first occurrence number of the i-th attribute value in the initial sample set is recorded as n i , and the number of occurrences of the i-th attribute value in the sample set is recorded as mi . For the i-th attribute value, the product of the ratio of the size of the initial sample set to the sample set and the first occurrence number can be expressed as mn i /n. Among them, m/n represents the ratio of the size of the initial sample set to the sample set.
当mni/n为整数时,有mi=mni/n。当mni/n为非整数时,数据采样***30可以确定样本集与初始样本集在上述乘积向上取整和向下取整后的距离差Δi
When mni /n is an integer, mi = mni /n. When mni /n is a non-integer, the data sampling system 30 can determine the distance difference Δi between the sample set and the initial sample set after the product is rounded up and rounded down:
其中,ceil(*)表示向上取整函数,floor(*)表示向下取整函数。Among them, ceil(*) represents the upward rounding function, and floor(*) represents the downward rounding function.
被减数表示对第i种属性值采用向上取整(即mi=ceil(mni/n))后,第i种属性值在样本集中的分布情况与第i种属性值在初始样本集中的分布情况的距离,被减数越小,表明第i种属性值采用向上取整后在样本集中的分布情况与第i种属性值在初始样本集中的分布情况的距离越小,则第i种属性值采用向上取整后在样本集中的分布情况与第i种属性值在初始样本集中的分布情况越接近。The minuend represents the distance between the distribution of the i-th attribute value in the sample set and the distribution of the i-th attribute value in the initial sample set after the i-th attribute value is rounded up (i.e., mi = ceil(mni/n)). The smaller the minuend is, the smaller the distance between the distribution of the i-th attribute value in the sample set after rounding up and the distribution of the i-th attribute value in the initial sample set is, and the closer the distribution of the i-th attribute value in the sample set after rounding up is to the distribution of the i-th attribute value in the initial sample set.
减数表示对第i种属性值采用向下取整(即mi=floor(mni/n))后,第i种属性值在样本集中的分布情况与第i种属性值在初始样本集中的分布情况的距离,减数越小,表明第i种属性值采用向下取整后在样本集中的分布情况与第i种属性值在初始样本集中的分布情况的距离越小,则第i种属性值采用向下取整后在样本集中的分布情况与第i种属性值在初始样本集中的分布情况越接近。The subtrahend represents the distance between the distribution of the i-th attribute value in the sample set and the distribution of the i-th attribute value in the initial sample set after the i-th attribute value is rounded down (i.e., mi = floor(mni/n)). The smaller the subtrahend is, the smaller the distance between the distribution of the i-th attribute value in the sample set after rounding down and the distribution of the i-th attribute value in the initial sample set is, and the closer the distribution of the i-th attribute value in the sample set after rounding down is to the distribution of the i-th attribute value in the initial sample set.
基于此,数据采样***30可以对Δi进行排序,从而确定mi的取值。例如,数据采样***30可以将Δi按照升序排序,选择前个Δi,令Δi对应的mi=ceil(mni/n),令其余的Δi对应的mi=floor(mni/n)。其中,K表示在初始样本集中出现的属性值的取值种类。例如,初始样本集中的属性列为“城市”,属性列对应的属性值为“北京”、“杭州”,则有K=2。Based on this, the data sampling system 30 can sort Δ i to determine the value of mi . For example, the data sampling system 30 can sort Δ i in ascending order and select the first Δ i , let m i corresponding to Δ i = ceil(mn i /n), and let m i corresponding to the remaining Δ i = floor(mn i /n). Wherein, K represents the value type of the attribute value that appears in the initial sample set. For example, if the attribute column in the initial sample set is "city", and the attribute values corresponding to the attribute column are "Beijing" and "Hangzhou", then K = 2.
可以理解的是,Δi越小,表明第i种属性值采用向上取整后在样本集中的分布情况与第i种属性值在初始样本集中的分布情况越接近,即第i种属性值采用向上取整后获得的样本集更能够反映初始样本集的分布情况,因此前个Δi对应的mi采用向上取整的方式,剩余的Δi对应的mi采用向下取整的方式,从而使得获得的样本集更能代表初始样本集的数据分布情况。It can be understood that the smaller Δ i is, the closer the distribution of the i-th attribute value in the sample set after rounding up is to the distribution of the i-th attribute value in the initial sample set, that is, the sample set obtained after rounding up the i-th attribute value can better reflect the distribution of the initial sample set. The mi corresponding to the first Δi is rounded up, and the mi corresponding to the remaining Δi is rounded down, so that the obtained sample set can better represent the data distribution of the initial sample set.
在本申请实施例中,针对单属性列的离散型数据,根据初始样本集中属性值出现的比例进行数据采样,从而使得样本集能够表示初始样本集中多种属性值的分布情况。当出现属性值在样本集中出现的次数为非整数时,通过借鉴KL散度(Kullback-Leibler divergence)的思想,对向上取整和向下取整后的样本集的距离进行评估,从而确定样本集中属性值出现的次数,如此,能够减小样本集中数据分布情况与初始样本集中数据分布情况的不同程度。In the embodiment of the present application, for discrete data of a single attribute column, data sampling is performed according to the proportion of attribute values in the initial sample set, so that the sample set can represent the distribution of multiple attribute values in the initial sample set. When the number of times an attribute value appears in the sample set is a non-integer, the distance of the sample set after rounding up and rounding down is evaluated by drawing on the idea of KL divergence (Kullback-Leibler divergence), so as to determine the number of times the attribute value appears in the sample set, so that the difference between the data distribution in the sample set and the data distribution in the initial sample set can be reduced.
下面,将结合示例对上述方法进行直观说明。The above method will be explained intuitively below with examples.
参见表1所示的单属性列的离散型数据的初始样本集。其中,该初始样本集包括两列,第一列为索引列“用户ID”,索引值为“U01”至“U09”,第二列为属性列“地点”,属性值包括“北京”和“杭州”。可以看出,表1的属性列数量为1,属性值的数据类型为离散型。See the initial sample set of discrete data of a single attribute column shown in Table 1. The initial sample set includes two columns, the first column is the index column "User ID", the index values are "U01" to "U09", and the second column is the attribute column "Location", the attribute values include "Beijing" and "Hangzhou". It can be seen that the number of attribute columns in Table 1 is 1, and the data type of the attribute value is discrete.
表1
Table 1
针对表1中的数据进行数据采样,给定样本集中的数据行数为3,即m=3。有初始样本集中的数 据行数n=9,第1种属性值“北京”在初始样本集中出现的次数为n1=3,第2种属性值“杭州”在初始样本集中出现的次数为n2=6,从而可以确定第1种属性值“北京”在样本集中出现的次数为m1=mn1/n=1,第2种属性值“杭州”在样本集中出现的次数为m2=mn2/n=2。因此,属性列“地点”经过第二阶段采样的数据采样后获得的样本集为{“北京”,“杭州”,“杭州”}。Data sampling is performed for the data in Table 1. The number of data rows in the given sample set is 3, that is, m = 3. The number of data rows is n=9, the number of times the first attribute value "Beijing" appears in the initial sample set is n1 =3, and the number of times the second attribute value "Hangzhou" appears in the initial sample set is n2 =6. Therefore, it can be determined that the number of times the first attribute value "Beijing" appears in the sample set is m1 = mn1 /n=1, and the number of times the second attribute value "Hangzhou" appears in the sample set is m2 = mn2 /n=2. Therefore, the sample set obtained after the attribute column "location" is {"Beijing", "Hangzhou", "Hangzhou"} after the data sampling in the second stage.
情况二:单属性列的连续型数据。当属性列的数量为1,属性值的数据类型为连续型时,数据采样***30可以按照目标间隔采样获得样本集。Case 2: Continuous data of a single attribute column When the number of attribute columns is 1 and the data type of the attribute value is continuous, the data sampling system 30 can obtain a sample set by sampling at a target interval.
在一些可能的实现方式中,初始样本集中包括n行数据,样本集包括m行数据,样本集为初始样本集的子集。数据采样***30可以将属性列对应的多个属性值排序,从排序后的序列中,按照目标间隔选取数据获得样本集。In some possible implementations, the initial sample set includes n rows of data, the sample set includes m rows of data, and the sample set is a subset of the initial sample set. The data sampling system 30 can sort the multiple attribute values corresponding to the attribute column, and select data from the sorted sequence according to the target interval to obtain the sample set.
例如,数据采样***30将属性列对应的多个属性值按照升序排列,并将排序后的序列记为{x1,x2,x3,……,xn},则样本集可以表示为:
For example, the data sampling system 30 arranges the multiple attribute values corresponding to the attribute column in ascending order, and records the sorted sequence as {x 1 , x 2 , x 3 , . . . , x n }, then the sample set can be expressed as:
其中,表示目标间隔。in, Indicates the target interval.
在本申请实施例中,针对单属性列的连续型数据,以目标间隔在属性值排序后的序列中选取样本集,从而使得样本集能够反映初始样本集中的数据分布情况,保证样本集中数据具有代表性。In an embodiment of the present application, for continuous data of a single attribute column, a sample set is selected from the sequence after the attribute values are sorted at a target interval, so that the sample set can reflect the data distribution in the initial sample set, ensuring that the data in the sample set is representative.
下面,将结合示例对上述方法进行直观说明。The above method will be explained intuitively below with examples.
参见表2所示的单属性列的连续型数据的初始样本集。其中,该初始样本集包括两列,第一列为索引列“用户ID”,索引值为“U01”至“U09”,第二列为属性列“收入”,属性值包括{200,500,100,800,600,700,300,900,400}。可以看出,表2的属性列数量为1,属性值的数据类型为连续型。See the initial sample set of continuous data of a single attribute column shown in Table 2. The initial sample set includes two columns, the first column is the index column "User ID", the index values are "U01" to "U09", and the second column is the attribute column "Income", and the attribute values include {200, 500, 100, 800, 600, 700, 300, 900, 400}. It can be seen that the number of attribute columns in Table 2 is 1, and the data type of the attribute value is continuous.
表2
Table 2
针对表2中的数据进行数据采样,给定样本集中的数据行数为3,即m=3。有初始样本集中的数据行数n=9,对表2中的属性值进行升序排序后的序列为{100,200,300,400,500,600,700,800,900},根据式(2),可以确定选取的第一个样本的位置为floor((0+0.5)n/m)+1=2,选取的第二个样本的位置为floor((1+0.5)n/m)+1=5,选取的第三个样本的位置为floor((2+0.5)n/m)+1=8。因此,属性列“收入”经过第二阶段采样后获得的样本集为{200,500,800}。Data sampling is performed for the data in Table 2. The number of data rows in the given sample set is 3, that is, m = 3. The number of data rows in the initial sample set is n = 9. The sequence after the attribute values in Table 2 are sorted in ascending order is {100, 200, 300, 400, 500, 600, 700, 800, 900}. According to formula (2), it can be determined that the position of the first sample selected is floor((0+0.5)n/m)+1=2, the position of the second sample selected is floor((1+0.5)n/m)+1=5, and the position of the third sample selected is floor((2+0.5)n/m)+1=8. Therefore, the sample set obtained after the second stage sampling of the attribute column "income" is {200, 500, 800}.
情况三:多属性列的数据。当属性列的数量大于1时,数据采样***30可以基于贪心算法采样获得样本集。Case 3: Data with multiple attribute columns: When the number of attribute columns is greater than 1, the data sampling system 30 may obtain a sample set based on a greedy algorithm.
在一些可能的实现方式中,初始样本集中包括n行数据,样本集包括m行数据,样本集为初始样本集的子集。数据采样***30可以根据多个属性列中各属性列对应的属性值的数据类型,选取各属性列的第一列样本集合,接着从初始样本集中选取m行数据,获得初始行样本数据。其中,初始行样本数据 中各个属性列对应的属性值形成各个属性列的第二列样本集合。数据采样***30针对上述m行数据之外的目标行数据,确定目标行替换初始行样本数据中的一行数据时至少一个属性列的第一列样本集合与第二列样本集合的距离,并根据该距离,确定是否执行对初始行样本数据的替换操作,获得样本集。In some possible implementations, the initial sample set includes n rows of data, the sample set includes m rows of data, and the sample set is a subset of the initial sample set. The data sampling system 30 can select the first column sample set of each attribute column according to the data type of the attribute value corresponding to each attribute column in the multiple attribute columns, and then select m rows of data from the initial sample set to obtain initial row sample data. The attribute values corresponding to each attribute column in the data sample form the second column sample set of each attribute column. The data sampling system 30 determines the distance between the first column sample set and the second column sample set of at least one attribute column when the target row replaces a row of data in the initial row sample data for the target row data other than the m rows of data, and determines whether to perform the replacement operation on the initial row sample data according to the distance to obtain the sample set.
例如,当属性列对应的属性值的数据类型为离散型时,数据采样***30可以采用上述情况一描述的方法,选取该属性列的第一列样本集合。当属性列对应的属性值的数据类型为连续型时,数据采样***30可以采用上述情况二描述的方法,选取该属性列的第一列样本集合。进一步地,数据采样***30可以从初始样本集中选取前m行数据作为初始行样本数据。For example, when the data type of the attribute value corresponding to the attribute column is discrete, the data sampling system 30 can use the method described in the above case 1 to select the first column sample set of the attribute column. When the data type of the attribute value corresponding to the attribute column is continuous, the data sampling system 30 can use the method described in the above case 2 to select the first column sample set of the attribute column. Further, the data sampling system 30 can select the first m rows of data from the initial sample set as the initial row sample data.
具体地,当至少一个属性列的第一列样本集合与第二列样本集合的距离之和大于替换前各个属性列的第一列样本集合与第二列样本集合的距离之和时,数据采样***30可以拒绝执行目标行替换初始行样本数据中的一行数据的操作。Specifically, when the sum of the distances between the first column sample set and the second column sample set of at least one attribute column is greater than the sum of the distances between the first column sample set and the second column sample set of each attribute column before replacement, the data sampling system 30 may refuse to perform the operation of replacing a row of data in the initial row sample data with the target row.
可以理解的是,至少一个属性列的第一列样本集合与第二列样本集合的距离之和大于替换前各个属性列的第一列样本集合与第二列样本集合的距离之和,表明进行替换操作后的样本集无法更好地反映初始样本集中的数据分布情况,因此不进行替换操作。It can be understood that the sum of the distances between the first column sample set and the second column sample set of at least one attribute column is greater than the sum of the distances between the first column sample set and the second column sample set of each attribute column before replacement, indicating that the sample set after the replacement operation cannot better reflect the data distribution in the initial sample set, so the replacement operation is not performed.
进一步地,针对至少一个属性列中的第i列,可以根据第i列的属性值的数据类型,采用对应的方法确定目标行替换初始行样本数据中的一行数据时至少一个属性列的第一列样本集合与第二列样本集合的距离。Furthermore, for the i-th column in at least one attribute column, a corresponding method can be used to determine the distance between the first column sample set and the second column sample set of at least one attribute column when the target row replaces a row of data in the initial row sample data based on the data type of the attribute value of the i-th column.
当第i列对应的属性值为离散型时,数据采样***30可以确定第i列的各种属性值在第i列的第一列样本集合和替换后的第二列样本集合中出现次数的差值,接着根据上述差值的绝对值之和,确定目标行替换初始行样本数据中的一行数据时至少一个属性列的第一列样本集合与第二列样本集合的距离。When the attribute value corresponding to the i-th column is discrete, the data sampling system 30 can determine the difference in the number of occurrences of various attribute values of the i-th column in the first column sample set of the i-th column and the second column sample set after replacement, and then determine the distance between the first column sample set and the second column sample set of at least one attribute column when the target row replaces a row of data in the initial row sample data based on the sum of the absolute values of the above differences.
当第i列对应的属性值为连续型时,数据采样***30可以将第i列的第一列样本集合和替换后的第二列样本集合按照相同方式排序,并确定排序后的第一列样本集合中元素与排序后的第二列样本集合中相应的元素的差值,接着根据该差值的绝对值之和,确定目标行替换初始行样本数据中的一行数据时至少一个属性列的第一列样本集合与第二列样本集合的距离。When the attribute value corresponding to the i-th column is continuous, the data sampling system 30 can sort the first column sample set of the i-th column and the replaced second column sample set in the same way, and determine the difference between the elements in the sorted first column sample set and the corresponding elements in the sorted second column sample set, and then determine the distance between the first column sample set and the second column sample set of at least one attribute column when the target row replaces a row of data in the initial row sample data based on the sum of the absolute values of the difference.
例如,初始样本集中有d列数据,数据采样***30可以根据各属性列对应的属性值的数据类型,采用上述情况一或情况二中的方法,选取各个属性列的第一列样本集合,将第i列的第一列样本集合记为接着,数据采样***30将初始样本集中的前m行数据作为初始行样本数据,从而获得该初始行样本数据对应的第二列样本集合,将第i列的第二列样本集合记为SiFor example, there are d columns of data in the initial sample set. The data sampling system 30 can select the first column sample set of each attribute column by using the method in the above case 1 or case 2 according to the data type of the attribute value corresponding to each attribute column, and record the first column sample set of the i-th column as Next, the data sampling system 30 uses the first m rows of data in the initial sample set as initial row sample data, thereby obtaining a second column sample set corresponding to the initial row sample data, and records the second column sample set of the i-th column as S i .
针对初始行样本数据之外的目标行(即初始样本集中除前m行以外的行)数据,数据采样***30可以通过遍历目标行,确定目标行替换初始行样本数据中的一行数据时,第一列样本集合与第二列样本集合的距离 For target rows other than the initial row sample data (i.e., rows other than the first m rows in the initial sample set), the data sampling system 30 can traverse the target rows to determine the distance between the first column sample set and the second column sample set when the target row replaces a row of data in the initial row sample data.
具体地,当第i列对应的属性值的数据类型为离散型时,第i列对应的属性值的取值种类为K,将第j种取值在第一列样本集合中出现的次数记为第j种取值在第二列样本集合Si中出现的次数记为mj,此时,距离可以为:
Specifically, when the data type of the attribute value corresponding to the i-th column is discrete, the value types of the attribute value corresponding to the i-th column are K, and the j-th value is in the sample set of the first column. The number of occurrences in The number of times the jth value appears in the second column sample set Si is recorded as mj . At this time, the distance can be:
当第i列对应的属性值的数据类型为连续型时,将第一列样本集合和第二列样本集合Si按照相同方式排序。例如,可以将和Si按照升序排列,从而得到序列和{y1,y2,……,ym}。此时,距离可以为:
When the data type of the attribute value corresponding to the i-th column is continuous, the first column of sample set The second column of sample set Si is sorted in the same way. and Si are arranged in ascending order, thus obtaining the sequence and {y 1 ,y 2 ,…,y m }. In this case, the distance can be:
在本申请实施例中,通过借鉴KS统计量(Kolmogorov-Smirnov Statistic)的思想,针对多属性列的不同数据类型的属性值,对目标行替换初始行前后的距离进行评估,当该距离减小时,表明利用目标行对初始行进行替换后的样本集中的数据分布情况更接近初始样本集中的数据分布情况,此时,对初始行进行执行替换操作,通过遍历初始样本集中的行数据,最终确定样本集。如此,能够使得最终确定的样本集在极大程度上反映初始样本集中的数据分布,提升样本集中数据的代表性。 In the embodiment of the present application, by drawing on the idea of KS statistic (Kolmogorov-Smirnov Statistic), the distance before and after the target row replaces the initial row is evaluated for the attribute values of different data types of multi-attribute columns. When the distance decreases, it indicates that the data distribution in the sample set after the initial row is replaced by the target row is closer to the data distribution in the initial sample set. At this time, the replacement operation is performed on the initial row, and the sample set is finally determined by traversing the row data in the initial sample set. In this way, the sample set finally determined can reflect the data distribution in the initial sample set to a great extent, and improve the representativeness of the data in the sample set.
在一些实施例中,数据采样***30可以针对被考察的目标行,分别确定其替换初始行样本中的多行(例如可以为每一行)数据时至少一个属性列的第一列样本集合与第二列样本集合的距离,当存在多个至少一个属性列的第一列样本集合与第二列样本集合的距离之和小于替换前各个属性列的第一列样本集合与第二列样本集合的距离之和时,选择距离之和减小最大的行进行替换。例如,初始行样本中包括两行,将目标行替换初始行样本中的第一行时,距离之和可以减小0.5,将目标行替换初始行样本中的第二行时,距离之和可以减小0.7,此时,选择将目标行替换初始行样本中的第二行。对于每一次替换操作,通过选择替换后距离减小最大的行进行替换,从而提升替换效果。In some embodiments, the data sampling system 30 can determine the distance between the first column sample set and the second column sample set of at least one attribute column when the target row under investigation replaces multiple rows (for example, each row) of data in the initial row sample. When the sum of the distances between the first column sample set and the second column sample set of at least one attribute column is less than the sum of the distances between the first column sample set and the second column sample set of each attribute column before replacement, the row with the largest reduction in distance is selected for replacement. For example, the initial row sample includes two rows. When the target row replaces the first row in the initial row sample, the sum of the distances can be reduced by 0.5. When the target row replaces the second row in the initial row sample, the sum of the distances can be reduced by 0.7. At this time, the target row is selected to replace the second row in the initial row sample. For each replacement operation, the replacement effect is improved by selecting the row with the largest reduction in distance after replacement for replacement.
在一些实施例中,数据采样***30也可以针对被考察的目标行,以一定的顺序(例如可以为从上到下的顺序)确定其替换初始行样本中的某一行数据时至少一个属性列的第一列样本集合与第二列样本集合的距离,当存在至少一个属性列的第一列样本集合与第二列样本集合的距离之和小于替换前各个属性列的第一列样本集合与第二列样本集合的距离之和时,执行替换操作。例如,初始行样本中包括两行,当数据采样***30确定出将目标行替换初始行样本中的第一行,距离之和可以减小0.5时,直接选择将目标行替换初始行样本中的第一行,不再计算将目标行替换初始行样本中的第二行时的距离之和减小情况。对于每一次替换操作,通过选择第一次出现的替换后距离减小的行进行替换,从而提升替换速度。In some embodiments, the data sampling system 30 may also determine the distance between the first column sample set and the second column sample set of at least one attribute column when replacing a row of data in the initial row sample in a certain order (for example, from top to bottom), and perform a replacement operation when the sum of the distances between the first column sample set and the second column sample set of at least one attribute column is less than the sum of the distances between the first column sample set and the second column sample set of each attribute column before replacement. For example, the initial row sample includes two rows. When the data sampling system 30 determines that the sum of the distances can be reduced by 0.5 by replacing the first row in the initial row sample with the target row, the first row in the initial row sample is directly selected to replace the target row, and the reduction in the sum of the distances when replacing the second row in the initial row sample with the target row is no longer calculated. For each replacement operation, the replacement speed is improved by selecting the row whose distance is reduced after the first replacement.
在一些实施例中,在数据采样***30利用目标行对初始行样本进行替换操作判断时,可以不考虑初始行样本中已经进行替换过的行。例如,初始行样本中包括两行,目标行包括两行,记为第一目标行和第二目标行,数据采样***30已经执行了第一目标行替换初始行样本中第一行的操作,此时,在数据采样***30针对第二目标行确定距离的过程中,可以不考虑利用第二目标行替换第一目标行(即替换后的初始行样本中的第一行)后的距离之和,仅判断第二目标行替换初始行样本中的第二行后的距离之和,从而确定是否执行对初始行样本中的第二行的替换操作,从而减小计算量。In some embodiments, when the data sampling system 30 uses the target row to determine the replacement operation of the initial row sample, the row that has been replaced in the initial row sample may not be considered. For example, the initial row sample includes two rows, the target row includes two rows, which are recorded as the first target row and the second target row. The data sampling system 30 has already performed the operation of replacing the first row in the initial row sample with the first target row. At this time, in the process of determining the distance for the second target row by the data sampling system 30, the sum of the distance after the first target row is replaced by the second target row (i.e., the first row in the initial row sample after replacement) may not be considered, and only the sum of the distance after the second target row replaces the second row in the initial row sample is determined, so as to determine whether to perform the replacement operation on the second row in the initial row sample, thereby reducing the amount of calculation.
进一步地,本申请实施例可以提供剪枝策略以提升替换效率。在数据采样***30确定目标行替换后的至少一个属性列的第一列样本集合与第二列样本集合的距离之和时,若未计算完全部属性列,但上述距离之和已经大于替换前各个属性列的第一列样本集合与第二列样本集合的距离之和时,数据采样***30可以停止计算,拒绝目标行替换初始行样本数据中的某一行数据的操作。Furthermore, the embodiment of the present application may provide a pruning strategy to improve replacement efficiency. When the data sampling system 30 determines the sum of the distances between the first column sample set and the second column sample set of at least one attribute column after the target row is replaced, if all attribute columns have not been calculated, but the sum of the distances is greater than the sum of the distances between the first column sample set and the second column sample set of each attribute column before replacement, the data sampling system 30 may stop the calculation and reject the operation of replacing a row of data in the initial row sample data with the target row.
进一步地,本申请实施例可以提供增量计算以减小计算量。在数据采样***30确定目标行替换后的至少一个属性列的第一列样本集合与第二列样本集合的距离之和时,由于替换操作可能不会改变全部属性列对应的属性值,数据采样***30可以仅重新计算属性列对应的属性值发生改变的属性列的第一列样本集合与第二列样本集合的距离,对于属性列对应的属性值未发生改变的属性列,数据采样***30可以复用替换前该属性列的第一列样本集合与第二列样本集合的距离。Furthermore, the embodiments of the present application may provide incremental calculation to reduce the amount of calculation. When the data sampling system 30 determines the sum of the distances between the first column sample set and the second column sample set of at least one attribute column after the target row is replaced, since the replacement operation may not change the attribute values corresponding to all attribute columns, the data sampling system 30 may only recalculate the distances between the first column sample set and the second column sample set of the attribute column whose attribute values corresponding to the attribute column have changed, and for the attribute column whose attribute values corresponding to the attribute column have not changed, the data sampling system 30 may reuse the distances between the first column sample set and the second column sample set of the attribute column before replacement.
下面,将结合示例对上述方法进行直观说明。The above method will be explained intuitively below with examples.
参见表3所示的多属性列的初始样本集。其中,该初始样本集包括三列,第一列为索引列“用户ID”,索引值为“U01”至“U09”,第二列为属性列“地点”,属性值包括“北京”和“杭州”,第三列为属性列“收入”,属性值包括{200,500,100,800,600,700,300,900,400}。可以看出,表3的属性列数量为2,属性值的数据类型包括离散型和连续型。See the initial sample set of multiple attribute columns shown in Table 3. The initial sample set includes three columns. The first column is the index column "User ID", and the index values are "U01" to "U09". The second column is the attribute column "Location", and the attribute values include "Beijing" and "Hangzhou". The third column is the attribute column "Income", and the attribute values include {200, 500, 100, 800, 600, 700, 300, 900, 400}. It can be seen that the number of attribute columns in Table 3 is 2, and the data types of the attribute values include discrete and continuous types.
表3

table 3

针对表3中的数据进行数据采样,给定样本集中的数据行数为3,即m=3。Data sampling is performed on the data in Table 3, and the number of data rows in the given sample set is 3, that is, m=3.
首先,根据情况一和情况二中给出的单属性列的离散型数据的采样方法和单属性列的连续型数据的采样方法,确定第一列样本集合接着,选取表3中的前3行数据作为初始行样本数据,如表4所示。First, according to the sampling method of discrete data of a single attribute column and the sampling method of continuous data of a single attribute column given in Case 1 and Case 2, determine the first column sample set Next, the first three rows of data in Table 3 are selected as the initial row sample data, as shown in Table 4.
表4
Table 4
下面,针对目标行为初始样本集中的第4行(即“用户ID”为“U04”的数据行),确定替换前后第一列样本集合与第二列样本集合的距离之和。Next, for the 4th row in the target behavior initial sample set (ie, the data row with "user ID" "U04"), the sum of the distances between the first column sample set and the second column sample set before and after the replacement is determined.
针对初始行样本数据,可以确定第二列样本集合S1={“北京”,“杭州”,“北京”},S2={200,500,100}。因此,有“杭州”在中出现的次数“杭州”在S1中出现的次数“北京”在中出现的次数“北京”在S1中出现的次数此时, 进一步地,有y1=100,y2=200,y3=500,此时,从而可以得到替换前的距离之和为0.721。For the initial row sample data, the second column sample set S 1 = {"Beijing", "Hangzhou", "Beijing"}, S 2 = {200, 500, 100} can be determined. Therefore, there is "Hangzhou" in Number of times it appears in The number of times "Hangzhou" appears in S1 "Beijing" Number of times it appears in The number of times "Beijing" appears in S1 at this time, Furthermore, there are y 1 = 100, y 2 = 200, y 3 = 500. Therefore, the sum of the distances before replacement is 0.721.
类似地,当目标行替换初始行样本数据中的第3行(即“用户ID”为“U03”的数据行)时,有从而可以得到替换后的距离之和为0。Similarly, when the target row replaces the third row in the initial row sample data (that is, the data row with "user ID""U03"), there is Therefore, the sum of the distances after replacement is 0.
可以发现,当目标行替换初始行样本数据中的第3行时,距离之和减小了0.721,因此,数据采样***30可以执行该替换操作。同时,在本示例中,执行目标行替换初始行样本数据中的第3行的操作后,可以获得样本集,如表5所示。It can be found that when the target row replaces the third row in the initial row sample data, the sum of the distances decreases by 0.721, so the data sampling system 30 can perform the replacement operation. At the same time, in this example, after performing the operation of replacing the third row in the initial row sample data with the target row, a sample set can be obtained, as shown in Table 5.
表5
table 5
如此,数据采样***30可以根据属性列的数量以及属性值的数据类型,利用对应的采样方式进行数据采样,从而获得样本集。In this way, the data sampling system 30 can perform data sampling using a corresponding sampling method according to the number of attribute columns and the data types of attribute values, thereby obtaining a sample set.
进一步地,数据采样***30还可以向用户呈现该样本集,以实现数据预览,或者根据该样本集,通过人工智能(artificial intelligence,AI)算法进行数据分析。例如,数据采样***30可以将样本集发送至数据处理***40,从而在数据处理***40中进行数据分析等数据处理操作。Furthermore, the data sampling system 30 may also present the sample set to the user to achieve data preview, or perform data analysis based on the sample set through an artificial intelligence (AI) algorithm. For example, the data sampling system 30 may send the sample set to the data processing system 40, so that data processing operations such as data analysis may be performed in the data processing system 40.
该方法基于大量原始数据中属性列的数量和属性值的数据类型进行数据采样,从而获取接近全局数据分布的样本数据,能够提升样本数据的代表性,同时使得样本数据更加适用于数据预览场景,便于用户根据样本数据进行后续的数据处理。This method performs data sampling based on the number of attribute columns and the data type of attribute values in a large amount of original data, thereby obtaining sample data that is close to the global data distribution. This can improve the representativeness of the sample data and make the sample data more suitable for data preview scenarios, making it easier for users to perform subsequent data processing based on the sample data.
基于本申请实施例提供的数据采样方法,本申请实施例还提供了一种如前述的数据采样***30。下面结合附图对数据采样***30进行介绍。 Based on the data sampling method provided in the embodiment of the present application, the embodiment of the present application further provides a data sampling system 30 as described above. The data sampling system 30 is introduced below in conjunction with the accompanying drawings.
参见图7所示的数据采样***30的结构示意图,该***30包括:Referring to the structural schematic diagram of the data sampling system 30 shown in FIG7 , the system 30 includes:
获取模块302,用于获取数据集;An acquisition module 302 is used to acquire a data set;
确定模块304,用于确定数据集中属性列的数量和属性值的数据类型;A determination module 304 is used to determine the number of attribute columns and the data types of attribute values in the data set;
采样模块306,用于根据属性列的数量以及属性值的数据类型,从数据集中采样获得样本集。The sampling module 306 is used to obtain a sample set by sampling from the data set according to the number of attribute columns and the data types of attribute values.
其中,获取模块302、确定模块304和采样模块306可以为样本获取装置32中的模块。Among them, the acquisition module 302 , the determination module 304 and the sampling module 306 may be modules in the sample acquisition device 32 .
需要说明的是,上述模块的划分方式仅为本申请实施例提供的一种可能的实现方式。在其他可能的实现方式中,可以根据需要对模块采用不同的划分方式,本申请实施例对此不做限制。It should be noted that the above module division method is only a possible implementation method provided by the embodiment of the present application. In other possible implementation methods, different module division methods can be used as needed, and the embodiment of the present application does not limit this.
上述获取模块302、确定模块304和采样模块306可以通过硬件模块实现或通过软件模块实现。其中,获取模块302、确定模块304和采样模块306可以通过计算设备或者计算设备上的计算引擎实现。下面,以获取模块302为例进行说明。The acquisition module 302, determination module 304 and sampling module 306 can be implemented by hardware modules or software modules. The acquisition module 302, determination module 304 and sampling module 306 can be implemented by a computing device or a computing engine on a computing device. The acquisition module 302 is taken as an example for description.
当通过软件实现时,获取模块302可以是运行在计算设备或计算设备集群上的应用程序或者应用程序模块,如计算引擎等。When implemented by software, the acquisition module 302 may be an application or application module running on a computing device or a computing device cluster, such as a computing engine.
当通过硬件实现时,获取模块302中可以包括至少一个计算设备,如服务器等。或者,获取模块302也可以是利用专用集成电路(application-specific integrated circuit,ASIC)实现、或可编程逻辑器件(programmable logic device,PLD)实现的设备等。其中,上述PLD可以是复杂程序逻辑器件(complex programmable logical device,CPLD)、现场可编程门阵列(field-programmable gate array,FPGA)、通用阵列逻辑(generic array logic,GAL)或其任意组合实现。When implemented by hardware, the acquisition module 302 may include at least one computing device, such as a server, etc. Alternatively, the acquisition module 302 may also be a device implemented by an application-specific integrated circuit (ASIC) or a programmable logic device (PLD). The PLD may be a complex programmable logical device (CPLD), a field-programmable gate array (FPGA), a generic array logic (GAL) or any combination thereof.
在一些可能的实现方式中,该***30还包括:In some possible implementations, the system 30 further includes:
交互模块,用于向用户呈现样本集;或者,An interactive module, for presenting the sample set to the user; or,
数据分析模块,用于根据样本集,通过人工智能AI算法进行数据分析。The data analysis module is used to perform data analysis based on the sample set through artificial intelligence (AI) algorithms.
在一些可能的实现方式中,采样模块306具体用于:In some possible implementations, the sampling module 306 is specifically configured to:
属性列的数量为1,属性值的数据类型为离散型时,根据数据集中属性列对应的多种属性值在数据集中出现的比例,采样获得样本集。When the number of attribute columns is 1 and the data type of the attribute value is discrete, the sample set is obtained by sampling according to the proportion of multiple attribute values corresponding to the attribute columns in the data set that appear in the data set.
在一些可能的实现方式中,该数据集中包括n行数据,该样本集包括m行数据,该样本集为该数据集的子集,采样模块306具体用于:In some possible implementations, the data set includes n rows of data, the sample set includes m rows of data, the sample set is a subset of the data set, and the sampling module 306 is specifically used to:
获取多种属性值中第i种属性值在数据集中出现的第一次数;Get the first occurrence number of the i-th attribute value in the data set among multiple attribute values;
根据数据集和样本集的大小之比与数据第一次数的乘积,确定第i种属性值在样本集中出现的次数。The number of times the i-th attribute value appears in the sample set is determined based on the product of the ratio of the size of the data set to the sample set and the first order of the data.
在一些可能的实现方式中,采样模块306具体用于:In some possible implementations, the sampling module 306 is specifically configured to:
乘积为整数时,确定第i种属性值在样本集中出现的次数为该整数;When the product is an integer, the number of times the i-th attribute value appears in the sample set is determined to be the integer;
乘积非整数时,根据样本集与数据集在该乘积向上取整和向下取整后的距离差,将该乘积向上取整或向下取整后的整数确定为第i种属性值在样本集中出现的次数。When the product is not an integer, the integer after rounding up or down of the product is determined as the number of times the i-th attribute value appears in the sample set according to the distance difference between the sample set and the data set after the product is rounded up or down.
在一些可能的实现方式中,采样模块306具体用于:In some possible implementations, the sampling module 306 is specifically configured to:
属性列的数量为1,属性值的数据类型为连续型时,将属性列对应的多个属性值排序;When the number of attribute columns is 1 and the data type of the attribute value is continuous, sort the multiple attribute values corresponding to the attribute column;
从排序后的序列中,按照目标间隔选取数据获得样本集。From the sorted sequence, data is selected according to the target interval to obtain a sample set.
在一些可能的实现方式中,采样模块306具体用于:In some possible implementations, the sampling module 306 is specifically configured to:
属性列的数量大于1时,根据多个属性列中各个属性列对应的属性值的数据类型,选取各个属性列的第一列样本集合;When the number of attribute columns is greater than 1, the first column sample set of each attribute column is selected according to the data type of the attribute value corresponding to each attribute column in the multiple attribute columns;
从数据集中选取m行数据,获得初始行样本数据,其中,初始行样本数据中各个属性列对应的属性值形成各个属性列的第二列样本集合;Select m rows of data from the data set to obtain initial row sample data, wherein the attribute values corresponding to each attribute column in the initial row sample data form the second column sample set of each attribute column;
针对m行数据之外的目标行数据,确定目标行替换初始行样本数据中的一行数据时至少一个属性列的第一列样本集合与第二列样本集合的距离;For target row data other than the m rows of data, determine the distance between the first column sample set and the second column sample set of at least one attribute column when the target row replaces a row of data in the initial row sample data;
根据该距离,确定是否执行对初始行样本数据的替换操作,获得样本集。According to the distance, it is determined whether to perform a replacement operation on the initial row sample data to obtain a sample set.
在一些可能的实现方式中,采样模块306具体用于:In some possible implementations, the sampling module 306 is specifically configured to:
当至少一个属性列的第一列样本集合与第二列样本集合的距离之和大于替换前各个属性列的第一列样本集合与第二列样本集合的距离之和,拒绝执行目标行替换初始行样本数据中的一行数据的操作。When the sum of the distances between the first column sample set and the second column sample set of at least one attribute column is greater than the sum of the distances between the first column sample set and the second column sample set of each attribute column before replacement, the operation of replacing a row of data in the initial row sample data with the target row is rejected.
在一些可能的实现方式中,采样模块306具体用于: In some possible implementations, the sampling module 306 is specifically configured to:
至少一个属性列包括第i列,第i列对应的属性值为离散型时,确定第i列的各种属性值在第i列的第一列样本集合和替换后的第二列样本集合中出现次数的差值;At least one attribute column includes the i-th column, and when the attribute value corresponding to the i-th column is discrete, determining the difference in the number of occurrences of various attribute values of the i-th column in the first column sample set of the i-th column and the second column sample set after replacement;
根据该差值的绝对值之和,确定目标行替换初始行样本数据中的一行数据时至少一个属性列的第一列样本集合与第二列样本集合的距离。According to the sum of the absolute values of the differences, the distance between the first column sample set and the second column sample set of at least one attribute column when the target row replaces a row of data in the initial row sample data is determined.
在一些可能的实现方式中,采样模块306具体用于:In some possible implementations, the sampling module 306 is specifically configured to:
至少一个属性列包括第i列,第i列对应的属性值为连续型时,将第i列的第一列样本集合和替换后的第二列样本集合按照相同方式排序;At least one attribute column includes the i-th column, and when the attribute value corresponding to the i-th column is continuous, the first column sample set and the replaced second column sample set of the i-th column are sorted in the same way;
确定排序后的第一列样本集合中元素与排序后的第二列样本集合中相应的元素的差值;Determine the difference between the elements in the sorted first column sample set and the corresponding elements in the sorted second column sample set;
根据该差值的绝对值之和,确定目标行替换初始行样本数据中的一行数据时至少一个属性列的第一列样本集合与第二列样本集合的距离。According to the sum of the absolute values of the differences, the distance between the first column sample set and the second column sample set of at least one attribute column when the target row replaces a row of data in the initial row sample data is determined.
本申请还提供一种计算设备800。如图8所示,计算设备800包括:总线802、处理器804、存储器806和通信接口808。处理器804、存储器806和通信接口808之间通过总线802通信。计算设备800可以是服务器或终端设备。应理解,本申请不限定计算设备800中的处理器、存储器的个数。The present application also provides a computing device 800. As shown in FIG8 , the computing device 800 includes: a bus 802, a processor 804, a memory 806, and a communication interface 808. The processor 804, the memory 806, and the communication interface 808 communicate with each other through the bus 802. The computing device 800 can be a server or a terminal device. It should be understood that the present application does not limit the number of processors and memories in the computing device 800.
总线802可以是外设部件互连标准(peripheral component interconnect,PCI)总线或扩展工业标准结构(extended industry standard architecture,EISA)总线等。总线可以分为地址总线、数据总线、控制总线等。为便于表示,图8中仅用一条线表示,但并不表示仅有一根总线或一种类型的总线。总线802可包括在计算设备800各个部件(例如,存储器806、处理器804、通信接口808)之间传送信息的通路。The bus 802 may be a peripheral component interconnect (PCI) bus or an extended industry standard architecture (EISA) bus, etc. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of representation, FIG. 8 is represented by only one line, but does not mean that there is only one bus or one type of bus. The bus 802 may include a path for transmitting information between various components of the computing device 800 (e.g., the memory 806, the processor 804, and the communication interface 808).
处理器804可以包括中央处理器(central processing unit,CPU)、图形处理器(graphics processing unit,GPU)、微处理器(micro processor,MP)或者数字信号处理器(digital signal processor,DSP)等处理器中的任意一种或多种。Processor 804 may include any one or more of a central processing unit (CPU), a graphics processing unit (GPU), a microprocessor (MP), or a digital signal processor (DSP).
存储器806可以包括易失性存储器(volatile memory),例如随机存取存储器(random access memory,RAM)。存储器806还可以包括非易失性存储器(non-volatile memory),例如只读存储器(read-only memory,ROM),快闪存储器,机械硬盘(hard disk drive,HDD)或固态硬盘(solid state drive,SSD)。存储器806中存储有可执行的程序代码,处理器804执行该可执行的程序代码以实现前述缓存管理方法。具体的,存储器806上存有数据采样***30用于执行数据采样方法的指令。The memory 806 may include a volatile memory (volatile memory), such as a random access memory (RAM). The memory 806 may also include a non-volatile memory (non-volatile memory), such as a read-only memory (ROM), a flash memory, a hard disk drive (HDD) or a solid state drive (SSD). The memory 806 stores executable program code, and the processor 804 executes the executable program code to implement the aforementioned cache management method. Specifically, the memory 806 stores instructions for the data sampling system 30 to execute the data sampling method.
通信接口808使用例如但不限于网络接口卡、收发器一类的收发模块,来实现计算设备800与其他设备或通信网络之间的通信。The communication interface 808 uses a transceiver module such as, but not limited to, a network interface card or a transceiver to implement communication between the computing device 800 and other devices or communication networks.
本申请实施例还提供了一种计算设备集群。该计算设备集群包括至少一台计算设备。该计算设备可以是服务器,例如是中心服务器、边缘服务器,或者是本地数据中心中的本地服务器。在一些实施例中,计算设备也可以是台式机、笔记本电脑或者智能手机等终端设备。The embodiment of the present application also provides a computing device cluster. The computing device cluster includes at least one computing device. The computing device can be a server, such as a central server, an edge server, or a local server in a local data center. In some embodiments, the computing device can also be a terminal device such as a desktop computer, a laptop computer, or a smart phone.
如图9所示,所述计算设备集群包括至少一个计算设备800。计算设备集群中的一个或多个计算设备800中的存储器806中可以存有相同的数据采样***30用于执行数据采样方法的指令。As shown in Fig. 9, the computing device cluster includes at least one computing device 800. The memory 806 in one or more computing devices 800 in the computing device cluster may store the same data sampling system 30 for executing instructions of the data sampling method.
在一些可能的实现方式中,该计算设备集群中的一个或多个计算设备800也可以用于执行数据采样***30用于执行数据采样方法的部分指令。换言之,一个或多个计算设备800的组合可以共同执行数据采样***30用于执行数据采样方法的指令。In some possible implementations, one or more computing devices 800 in the computing device cluster may also be used to execute some instructions of the data sampling system 30 for executing the data sampling method. In other words, a combination of one or more computing devices 800 may jointly execute instructions of the data sampling system 30 for executing the data sampling method.
需要说明的是,计算设备集群中的不同的计算设备800中的存储器806可以存储不同的指令,用于执行数据采样***30的部分功能。It should be noted that the memory 806 in different computing devices 800 in the computing device cluster may store different instructions for executing partial functions of the data sampling system 30 .
图10示出了一种可能的实现方式。如图10所示,两个计算设备800A和800B通过通信接口808实现连接。计算设备800A中的存储器上存有用于执行获取模块302和确定模块304的功能的指令。计算设备800B中的存储器上存有用于执行采样模块306的功能的指令。换言之,计算设备800A和800B的存储器806共同存储了数据采样***30用于执行数据采样方法的指令。FIG10 shows a possible implementation. As shown in FIG10 , two computing devices 800A and 800B are connected via a communication interface 808. The memory in the computing device 800A stores instructions for executing the functions of the acquisition module 302 and the determination module 304. The memory in the computing device 800B stores instructions for executing the functions of the sampling module 306. In other words, the memories 806 of the computing devices 800A and 800B jointly store instructions for the data sampling system 30 to execute the data sampling method.
图10所示的计算设备集群之间的连接方式可以是考虑到本申请提供的数据采样方法需要确定数据集中属性列、属性值的相关信息,从而进行数据采样。因此考虑将获取模块302和确定模块304实现的功能交由计算设备800A执行,采样模块306实现的功能由计算设备800B执行。 The connection mode between the computing device clusters shown in FIG10 may be considered that the data sampling method provided by the present application needs to determine the relevant information of the attribute columns and attribute values in the data set, so as to perform data sampling. Therefore, it is considered that the functions implemented by the acquisition module 302 and the determination module 304 are performed by the computing device 800A, and the functions implemented by the sampling module 306 are performed by the computing device 800B.
应理解,图10中示出的计算设备800A的功能也可以由多个计算设备800完成。同样,计算设备800B的功能也可以由多个计算设备800完成。It should be understood that the functionality of the computing device 800A shown in FIG10 may also be completed by multiple computing devices 800. Similarly, the functionality of the computing device 800B may also be completed by multiple computing devices 800.
在一些可能的实现方式中,计算设备集群中的一个或多个计算设备可以通过网络连接。其中,所述网络可以是广域网或局域网等等。图11示出了一种可能的实现方式。如图11所示,两个计算设备800C和800D之间通过网络进行连接。具体地,通过各个计算设备中的通信接口与所述网络进行连接。在这一类可能的实现方式中,计算设备800C中的存储器806中存有执行获取模块302和确定模块304的功能的指令。同时,计算设备800D中的存储器806中存有执行采样模块306的功能的指令。In some possible implementations, one or more computing devices in the computing device cluster may be connected via a network. The network may be a wide area network or a local area network, etc. FIG. 11 shows a possible implementation. As shown in FIG. 11 , two computing devices 800C and 800D are connected via a network. Specifically, the network is connected via a communication interface in each computing device. In this type of possible implementation, the memory 806 in the computing device 800C stores instructions for executing the functions of the acquisition module 302 and the determination module 304. At the same time, the memory 806 in the computing device 800D stores instructions for executing the functions of the sampling module 306.
图11所示的计算设备集群之间的连接方式可以是考虑到本申请提供的数据采样方法需要确定数据集中属性列、属性值的相关信息,从而进行数据采样。因此考虑将获取模块302和确定模块304实现的功能交由计算设备800C执行,采样模块306实现的功能由计算设备800D执行。应理解,图11中示出的计算设备800C的功能也可以由多个计算设备800完成。同样,计算设备800D的功能也可以由多个计算设备800完成。The connection mode between the computing device clusters shown in FIG11 may be based on the consideration that the data sampling method provided in the present application needs to determine the relevant information of the attribute columns and attribute values in the data set, so as to perform data sampling. Therefore, it is considered that the functions implemented by the acquisition module 302 and the determination module 304 are handed over to the computing device 800C for execution, and the functions implemented by the sampling module 306 are executed by the computing device 800D. It should be understood that the functions of the computing device 800C shown in FIG11 can also be completed by multiple computing devices 800. Similarly, the functions of the computing device 800D can also be completed by multiple computing devices 800.
本申请实施例还提供了一种计算机可读存储介质。所述计算机可读存储介质可以是计算设备能够存储的任何可用介质或者是包含一个或多个可用介质的数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘)等。该计算机可读存储介质包括指令,所述指令指示计算设备执行上述应用于数据采样***用于执行数据采样方法。The embodiment of the present application also provides a computer-readable storage medium. The computer-readable storage medium can be any available medium that can be stored by a computing device or a data storage device such as a data center that contains one or more available media. The available medium can be a magnetic medium (e.g., a floppy disk, a hard disk, a tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a solid-state hard disk). The computer-readable storage medium includes instructions that instruct the computing device to execute the above-mentioned data sampling system for executing the data sampling method.
本申请实施例还提供了一种包含指令的计算机程序产品。所述计算机程序产品可以是包含指令的,能够运行在计算设备上或被储存在任何可用介质中的软件或程序产品。当所述计算机程序产品在至少一个计算设备上运行时,使得至少一个计算设备执行上述数据采样方法。The embodiment of the present application also provides a computer program product including instructions. The computer program product may be software or a program product including instructions that can be run on a computing device or stored in any available medium. When the computer program product is run on at least one computing device, the at least one computing device executes the above data sampling method.
最后应说明的是:以上实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的保护范围。 Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, rather than to limit it. Although the present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that they can still modify the technical solutions described in the aforementioned embodiments, or make equivalent replacements for some of the technical features therein. However, these modifications or replacements do not cause the essence of the corresponding technical solutions to deviate from the protection scope of the technical solutions of the embodiments of the present invention.

Claims (23)

  1. 一种数据采样方法,其特征在于,所述方法包括:A data sampling method, characterized in that the method comprises:
    获取数据集;Get the dataset;
    确定所述数据集中属性列的数量和属性值的数据类型;Determine the number of attribute columns and data types of attribute values in the data set;
    根据所述属性列的数量以及所述属性值的数据类型,从所述数据集中采样获得样本集。A sample set is obtained by sampling from the data set according to the number of the attribute columns and the data types of the attribute values.
  2. 根据权利要求1所述的方法,其特征在于,所述方法还包括:The method according to claim 1, characterized in that the method further comprises:
    向用户呈现所述样本集;或者,presenting the sample set to a user; or,
    根据所述样本集,通过人工智能AI算法进行数据分析。Based on the sample set, data analysis is performed using an artificial intelligence (AI) algorithm.
  3. 根据权利要求1所述的方法,其特征在于,所述属性列的数量为1,所述属性值的数据类型为离散型时,所述根据所述属性列的数量以及所述属性值的数据类型,从所述数据集中采样获得样本集,包括:The method according to claim 1, characterized in that when the number of the attribute columns is 1 and the data type of the attribute value is discrete, sampling from the data set to obtain a sample set according to the number of the attribute columns and the data type of the attribute value comprises:
    根据所述数据集中所述属性列对应的多种属性值在所述数据集中出现的比例,采样获得样本集。According to the proportions of the multiple attribute values corresponding to the attribute columns in the data set appearing in the data set, sampling is performed to obtain a sample set.
  4. 根据权利要求3所述的方法,其特征在于,所述数据集中包括n行数据,所述样本集包括m行数据,所述样本集为所述数据集的子集,所述根据所述数据集中所述属性列对应的多种属性值出现的比例,采样获得样本数据,包括:The method according to claim 3 is characterized in that the data set includes n rows of data, the sample set includes m rows of data, the sample set is a subset of the data set, and the sampling to obtain sample data according to the proportion of occurrence of multiple attribute values corresponding to the attribute column in the data set includes:
    获取所述多种属性值中第i种属性值在所述数据集中出现的第一次数;Obtain the first occurrence number of the i-th attribute value among the multiple attribute values in the data set;
    根据所述数据集和所述样本集的大小之比与所述数据第一次数的乘积,确定所述第i种属性值在所述样本集中出现的次数。The number of times the i-th attribute value appears in the sample set is determined according to the product of the ratio of the size of the data set to the size of the sample set and the first number of the data.
  5. 根据权利要求4所述的方法,其特征在于,所述根据所述数据集和所述样本集的大小之比与所述数据第一次数的乘积,确定所述第i种属性值在所述样本集中出现的次数,包括:The method according to claim 4, characterized in that the step of determining the number of occurrences of the i-th attribute value in the sample set according to the product of the ratio of the size of the data set to the sample set and the first number of the data comprises:
    所述乘积为整数时,确定所述第i种属性值在所述样本集中出现的次数为所述整数;When the product is an integer, determining the number of times the i-th attribute value appears in the sample set is the integer;
    所述乘积非整数时,根据所述样本集与所述数据集在所述乘积向上取整和向下取整后的距离差,将所述乘积向上取整或向下取整后的整数确定为所述第i种属性值在所述样本集中出现的次数。When the product is not an integer, according to the distance difference between the sample set and the data set after the product is rounded up or rounded down, the integer after the product is rounded up or rounded down is determined as the number of times the i-th attribute value appears in the sample set.
  6. 根据权利要求1所述的方法,其特征在于,所述属性列的数量为1,所述属性值的数据类型为连续型时,所述根据所述属性列的数量以及所述属性值的数据类型,从所述数据集中采样获得样本集,包括:The method according to claim 1, characterized in that when the number of the attribute columns is 1 and the data type of the attribute value is continuous, sampling from the data set to obtain a sample set according to the number of the attribute columns and the data type of the attribute value comprises:
    将所述属性列对应的多个属性值排序;Sort multiple attribute values corresponding to the attribute column;
    从排序后的序列中,按照目标间隔选取数据获得样本集。From the sorted sequence, data is selected according to the target interval to obtain a sample set.
  7. 根据权利要求1所述的方法,其特征在于,所述属性列的数量大于1时,所述根据所述属性列的数量以及所述属性值的数据类型,从所述数据集中采样获得样本集,包括:The method according to claim 1, characterized in that when the number of the attribute columns is greater than 1, sampling from the data set to obtain a sample set according to the number of the attribute columns and the data type of the attribute value comprises:
    根据多个属性列中各个属性列对应的属性值的数据类型,选取各个属性列的第一列样本集合;According to the data type of the attribute value corresponding to each attribute column in the multiple attribute columns, select the first column sample set of each attribute column;
    从所述数据集中选取m行数据,获得初始行样本数据,所述初始行样本数据中各个属性列对应的属性值形成各个属性列的第二列样本集合;Select m rows of data from the data set to obtain initial row sample data, wherein the attribute values corresponding to each attribute column in the initial row sample data form a second column sample set of each attribute column;
    针对所述m行数据之外的目标行数据,确定所述目标行替换所述初始行样本数据中的一行数据时至少一个属性列的第一列样本集合与第二列样本集合的距离;For target row data other than the m rows of data, determining a distance between a first column sample set and a second column sample set of at least one attribute column when the target row replaces a row of data in the initial row sample data;
    根据所述距离,确定是否执行对所述初始行样本数据的替换操作,获得样本集。According to the distance, it is determined whether to perform a replacement operation on the initial row sample data to obtain a sample set.
  8. 根据权利要求7所述的方法,其特征在于,所述根据所述距离,确定是否执行对所述初始行样本数据的替换操作,包括:The method according to claim 7, characterized in that the step of determining whether to perform a replacement operation on the initial row sample data according to the distance comprises:
    当所述至少一个属性列的第一列样本集合与第二列样本集合的距离之和大于替换前各个属性列的第一列样本集合与第二列样本集合的距离之和,拒绝执行所述目标行替换所述初始行样本数据中的一行数据的操作。When the sum of the distances between the first column sample set and the second column sample set of the at least one attribute column is greater than the sum of the distances between the first column sample set and the second column sample set of each attribute column before replacement, the operation of replacing a row of data in the initial row sample data with the target row is rejected.
  9. 根据权利要求7或8所述的方法,其特征在于,所述至少一个属性列包括第i列,所述第i列对应的属性值为离散型时,所述确定所述目标行替换所述初始行样本数据中的一行数据时至少一个属性列的第一列样本集合与第二列样本集合的距离,包括:The method according to claim 7 or 8, characterized in that the at least one attribute column includes an i-th column, and when the attribute value corresponding to the i-th column is discrete, determining the distance between a first column sample set and a second column sample set of at least one attribute column when the target row replaces a row of data in the initial row sample data comprises:
    确定所述第i列的各种属性值在所述第i列的第一列样本集合和替换后的第二列样本集合中出现次数的差值; Determine the difference in the number of occurrences of various attribute values of the i-th column in the first column sample set and the second column sample set after replacement of the i-th column;
    根据所述差值的绝对值之和,确定所述目标行替换所述初始行样本数据中的一行数据时至少一个属性列的第一列样本集合与第二列样本集合的距离。According to the sum of the absolute values of the differences, a distance between a first column sample set and a second column sample set of at least one attribute column when the target row replaces a row of data in the initial row sample data is determined.
  10. 根据权利要求7或8所述的方法,其特征在于,所述至少一个属性列包括第i列,所述第i列对应的属性值为连续型时,所述确定所述目标行替换所述初始行样本数据中的一行数据时至少一个属性列的第一列样本集合与第二列样本集合的距离,包括:The method according to claim 7 or 8, characterized in that the at least one attribute column includes an i-th column, and when the attribute value corresponding to the i-th column is continuous, determining the distance between a first column sample set and a second column sample set of at least one attribute column when the target row replaces a row of data in the initial row sample data comprises:
    将所述第i列的第一列样本集合和替换后的第二列样本集合按照相同方式排序;Sort the first column sample set and the replaced second column sample set of the i-th column in the same way;
    确定排序后的第一列样本集合中元素与排序后的第二列样本集合中相应的元素的差值;Determine the difference between the elements in the sorted first column sample set and the corresponding elements in the sorted second column sample set;
    根据所述差值的绝对值之和,确定所述目标行替换所述初始行样本数据中的一行数据时至少一个属性列的第一列样本集合与第二列样本集合的距离。According to the sum of the absolute values of the differences, a distance between a first column sample set and a second column sample set of at least one attribute column when the target row replaces a row of data in the initial row sample data is determined.
  11. 一种数据采样***,其特征在于,所述***包括:A data sampling system, characterized in that the system comprises:
    获取模块,用于获取数据集;The acquisition module is used to obtain the data set;
    确定模块,用于确定所述数据集中属性列的数量和属性值的数据类型;A determination module, used to determine the number of attribute columns and the data types of attribute values in the data set;
    采样模块,用于根据所述属性列的数量以及所述属性值的数据类型,从所述数据集中采样获得样本集。The sampling module is used to obtain a sample set by sampling from the data set according to the number of the attribute columns and the data type of the attribute value.
  12. 根据权利要求11所述的***,其特征在于,所述***还包括:The system according to claim 11, characterized in that the system further comprises:
    交互模块,用于向用户呈现所述样本集;或者,an interaction module, configured to present the sample set to a user; or,
    数据分析模块,用于根据所述样本集,通过人工智能AI算法进行数据分析。The data analysis module is used to perform data analysis based on the sample set through an artificial intelligence (AI) algorithm.
  13. 根据权利要求11所述的***,其特征在于,所述采样模块具体用于:The system according to claim 11, characterized in that the sampling module is specifically used for:
    所述属性列的数量为1,所述属性值的数据类型为离散型时,根据所述数据集中所述属性列对应的多种属性值在所述数据集中出现的比例,采样获得样本集。When the number of the attribute columns is 1 and the data type of the attribute value is discrete, a sample set is obtained by sampling according to the proportions of the multiple attribute values corresponding to the attribute columns in the data set appearing in the data set.
  14. 根据权利要求13所述的***,其特征在于,所述数据集中包括n行数据,所述样本集包括m行数据,所述样本集为所述数据集的子集,所述采样模块具体用于:The system according to claim 13, characterized in that the data set includes n rows of data, the sample set includes m rows of data, the sample set is a subset of the data set, and the sampling module is specifically used to:
    获取所述多种属性值中第i种属性值在所述数据集中出现的第一次数;Obtain the first occurrence number of the i-th attribute value among the multiple attribute values in the data set;
    根据所述数据集和所述样本集的大小之比与所述数据第一次数的乘积,确定所述第i种属性值在所述样本集中出现的次数。The number of times the i-th attribute value appears in the sample set is determined according to the product of the ratio of the size of the data set to the size of the sample set and the first number of the data.
  15. 根据权利要求14所述的***,其特征在于,所述采样模块具体用于:The system according to claim 14, characterized in that the sampling module is specifically used for:
    所述乘积为整数时,确定所述第i种属性值在所述样本集中出现的次数为所述整数;When the product is an integer, determining the number of times the i-th attribute value appears in the sample set is the integer;
    所述乘积非整数时,根据所述样本集与所述数据集在所述乘积向上取整和向下取整后的距离差,将所述乘积向上取整或向下取整后的整数确定为所述第i种属性值在所述样本集中出现的次数。When the product is not an integer, according to the distance difference between the sample set and the data set after the product is rounded up or rounded down, the integer after the product is rounded up or rounded down is determined as the number of times the i-th attribute value appears in the sample set.
  16. 根据权利要求11所述的***,其特征在于,所述采样模块具体用于:The system according to claim 11, characterized in that the sampling module is specifically used for:
    所述属性列的数量为1,所述属性值的数据类型为连续型时,将所述属性列对应的多个属性值排序;When the number of the attribute columns is 1 and the data type of the attribute value is continuous, sort the multiple attribute values corresponding to the attribute column;
    从排序后的序列中,按照目标间隔选取数据获得样本集。From the sorted sequence, data is selected according to the target interval to obtain a sample set.
  17. 根据权利要求11所述的***,其特征在于,所述采样模块具体用于:The system according to claim 11, characterized in that the sampling module is specifically used for:
    所述属性列的数量大于1时,根据多个属性列中各个属性列对应的属性值的数据类型,选取各个属性列的第一列样本集合;When the number of the attribute columns is greater than 1, selecting the first column sample set of each attribute column according to the data type of the attribute value corresponding to each attribute column in the multiple attribute columns;
    从所述数据集中选取m行数据,获得初始行样本数据,所述初始行样本数据中各个属性列对应的属性值形成各个属性列的第二列样本集合;Select m rows of data from the data set to obtain initial row sample data, wherein the attribute values corresponding to each attribute column in the initial row sample data form a second column sample set of each attribute column;
    针对所述m行数据之外的目标行数据,确定所述目标行替换所述初始行样本数据中的一行数据时至少一个属性列的第一列样本集合与第二列样本集合的距离;For target row data other than the m rows of data, determining a distance between a first column sample set and a second column sample set of at least one attribute column when the target row replaces a row of data in the initial row sample data;
    根据所述距离,确定是否执行对所述初始行样本数据的替换操作,获得样本集。According to the distance, it is determined whether to perform a replacement operation on the initial row sample data to obtain a sample set.
  18. 根据权利要求17所述的***,其特征在于,所述采样模块具体用于:The system according to claim 17, characterized in that the sampling module is specifically used for:
    当所述至少一个属性列的第一列样本集合与第二列样本集合的距离之和大于替换前各个属性列的第一列样本集合与第二列样本集合的距离之和,拒绝执行所述目标行替换所述初始行样本数据中的一行数据的操作。When the sum of the distances between the first column sample set and the second column sample set of the at least one attribute column is greater than the sum of the distances between the first column sample set and the second column sample set of each attribute column before replacement, the operation of replacing a row of data in the initial row sample data with the target row is rejected.
  19. 根据权利要求17或18所述的***,其特征在于,所述采样模块具体用于:The system according to claim 17 or 18, characterized in that the sampling module is specifically used for:
    所述至少一个属性列包括第i列,所述第i列对应的属性值为离散型时,确定所述第i列的各种属 性值在所述第i列的第一列样本集合和替换后的第二列样本集合中出现次数的差值;The at least one attribute column includes an i-th column, and when the attribute value corresponding to the i-th column is discrete, the various attributes of the i-th column are determined. The difference between the number of occurrences of the property value in the first column sample set and the second column sample set after replacement in the i-th column;
    根据所述差值的绝对值之和,确定所述目标行替换所述初始行样本数据中的一行数据时至少一个属性列的第一列样本集合与第二列样本集合的距离。According to the sum of the absolute values of the differences, a distance between a first column sample set and a second column sample set of at least one attribute column when the target row replaces a row of data in the initial row sample data is determined.
  20. 根据权利要求17或18所述的***,其特征在于,所述采样模块具体用于:The system according to claim 17 or 18, characterized in that the sampling module is specifically used for:
    所述至少一个属性列包括第i列,所述第i列对应的属性值为连续型时,将所述第i列的第一列样本集合和替换后的第二列样本集合按照相同方式排序;The at least one attribute column includes an i-th column, and when the attribute value corresponding to the i-th column is continuous, the first column sample set and the replaced second column sample set of the i-th column are sorted in the same way;
    确定排序后的第一列样本集合中元素与排序后的第二列样本集合中相应的元素的差值;Determine the difference between the elements in the sorted first column sample set and the corresponding elements in the sorted second column sample set;
    根据所述差值的绝对值之和,确定所述目标行替换所述初始行样本数据中的一行数据时至少一个属性列的第一列样本集合与第二列样本集合的距离。According to the sum of the absolute values of the differences, a distance between a first column sample set and a second column sample set of at least one attribute column when the target row replaces a row of data in the initial row sample data is determined.
  21. 一种计算设备集群,其特征在于,所述计算设备集群包括至少一台计算设备,所述至少一台计算设备包括至少一个处理器和至少一个存储器,所述至少一个存储器中存储有计算机可读指令;所述至少一个处理器执行所述计算机可读指令,以使得所述计算设备集群执行如权利要求1至10中任一项所述的方法。A computing device cluster, characterized in that the computing device cluster includes at least one computing device, the at least one computing device includes at least one processor and at least one memory, and the at least one memory stores computer-readable instructions; the at least one processor executes the computer-readable instructions so that the computing device cluster executes the method as described in any one of claims 1 to 10.
  22. 一种计算机可读存储介质,其特征在于,包括计算机可读指令;所述计算机可读指令用于实现权利要求1至10任一项所述的方法。A computer-readable storage medium, characterized in that it includes computer-readable instructions; the computer-readable instructions are used to implement the method described in any one of claims 1 to 10.
  23. 一种计算机程序产品,其特征在于,包括计算机可读指令;所述计算机可读指令用于实现权利要求1至10任一项所述的方法。 A computer program product, characterized in that it comprises computer-readable instructions; the computer-readable instructions are used to implement the method described in any one of claims 1 to 10.
PCT/CN2023/100937 2022-11-03 2023-06-19 Data sampling method and related device WO2024093253A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211372127.7A CN118035505A (en) 2022-11-03 2022-11-03 Data sampling method and related equipment
CN202211372127.7 2022-11-03

Publications (1)

Publication Number Publication Date
WO2024093253A1 true WO2024093253A1 (en) 2024-05-10

Family

ID=90929555

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/100937 WO2024093253A1 (en) 2022-11-03 2023-06-19 Data sampling method and related device

Country Status (2)

Country Link
CN (1) CN118035505A (en)
WO (1) WO2024093253A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130226940A1 (en) * 2012-02-28 2013-08-29 International Business Machines Corporation Generating Composite Key Relationships Between Database Objects Based on Sampling
US8812947B1 (en) * 2011-12-08 2014-08-19 Google Inc. Ranking graphical visualizations of a data set according to data attributes
US20170116227A1 (en) * 2015-10-23 2017-04-27 Oracle International Corporation System and method for extracting a star schema from tabular data for use in a multidimensional database environment
CN111813800A (en) * 2020-09-03 2020-10-23 国网浙江省电力有限公司营销服务中心 Streaming data real-time approximate calculation method based on deep reinforcement learning
CN113496119A (en) * 2020-03-20 2021-10-12 北京庖丁科技有限公司 Method, electronic device and computer readable medium for extracting tuple data in table

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8812947B1 (en) * 2011-12-08 2014-08-19 Google Inc. Ranking graphical visualizations of a data set according to data attributes
US20130226940A1 (en) * 2012-02-28 2013-08-29 International Business Machines Corporation Generating Composite Key Relationships Between Database Objects Based on Sampling
US20170116227A1 (en) * 2015-10-23 2017-04-27 Oracle International Corporation System and method for extracting a star schema from tabular data for use in a multidimensional database environment
CN113496119A (en) * 2020-03-20 2021-10-12 北京庖丁科技有限公司 Method, electronic device and computer readable medium for extracting tuple data in table
CN111813800A (en) * 2020-09-03 2020-10-23 国网浙江省电力有限公司营销服务中心 Streaming data real-time approximate calculation method based on deep reinforcement learning

Also Published As

Publication number Publication date
CN118035505A (en) 2024-05-14

Similar Documents

Publication Publication Date Title
US20230126005A1 (en) Consistent filtering of machine learning data
US10725981B1 (en) Analyzing big data
US20200050968A1 (en) Interactive interfaces for machine learning model evaluations
EP3161635B1 (en) Machine learning service
CN110019218B (en) Data storage and query method and equipment
US11100420B2 (en) Input processing for machine learning
US11182691B1 (en) Category-based sampling of machine learning data
US9361320B1 (en) Modeling big data
US20210049163A1 (en) Data preparation context navigation
WO2015035864A1 (en) Method, apparatus and system for data analysis
US20240126817A1 (en) Graph data query
WO2017096892A1 (en) Index construction method, search method, and corresponding device, apparatus, and computer storage medium
US11036608B2 (en) Identifying differences in resource usage across different versions of a software application
WO2024021555A1 (en) Resource examination and approval method and device, and random forest model training method and device
WO2021077976A1 (en) Data acquisition method and apparatus based on wind power bidding users, and device and medium
WO2020082597A1 (en) Method and device for batch insertion and deletion of b+ tree nodes
CN114416670B (en) Index creating method and device suitable for network disk document, network disk and storage medium
WO2024093253A1 (en) Data sampling method and related device
CN107430633B (en) System and method for data storage and computer readable medium
CN112052248A (en) Audit big data processing method and system
US9507794B2 (en) Method and apparatus for distributed processing of file
US9785404B2 (en) Method and system for analyzing data in artifacts and creating a modifiable data network
WO2021072776A1 (en) Data merging method and apparatus, electronic device, and storage medium
US11822582B2 (en) Metadata clustering
CN115018477B (en) Big data analysis method and equipment based on enterprise OA system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23884192

Country of ref document: EP

Kind code of ref document: A1