CN111447278A

CN111447278A - Distributed system for acquiring continuous features and method thereof

Info

Publication number: CN111447278A
Application number: CN202010229130.8A
Authority: CN
Inventors: 罗远飞; 焦英翔; 郑淇木; 石光川
Original assignee: 4Paradigm Beijing Technology Co Ltd
Current assignee: 4Paradigm Beijing Technology Co Ltd
Priority date: 2020-03-27
Filing date: 2020-03-27
Publication date: 2020-07-24
Anticipated expiration: 2040-03-27
Also published as: CN111447278B

Abstract

A distributed system for acquiring continuous features and a method thereof are provided. The distributed system includes: a plurality of node devices configured to perform serialization processing on at least one discrete field in a data sample of a designated data set in parallel to obtain a continuous feature corresponding to each discrete field; the plurality of node devices acquire data samples needing processing of each round according to the ordering of the data samples in the designated data set, and acquire data samples in the front of the ordering earlier by the node devices, wherein each node device acquires a feature value of a continuous feature corresponding to a field value of a discrete field in the data samples needing processing of the round based on a historical statistical result about the discrete field acquired from the server in the round, and transmits the field value to the server, so that the historical statistical result about the discrete field is updated by the server based on the field value.

Description

Distributed system for acquiring continuous features and method thereof

Technical Field

The present invention relates generally to the field of data processing, and more particularly, to a distributed system for acquiring continuous features and a method thereof.

Background

The decision tree model (GBDT) selects a value for a feature as a cut point in the training process, and the size relationship between the feature value of the feature and the cut point in a sample determines how the sample is classified. This feature of the decision tree model makes the decision tree model more suitable for processing continuous features because there is usually no relative size relationship between the feature values of discrete features, for example, the decision tree model can be classified based on "age of user", and the ages less than 18 and greater than or equal to 18 can be classified into different classes, but "address of user" cannot be processed with similar logic.

To solve this problem, One-Hot Encoding (One-Hot Encoding) is usually performed on discrete values, for example, the address of the user can be divided into two branches on the decision tree, which are located in the lake region and not located in the lake region, respectively, but this has the disadvantage that the feature quantity is very large, which leads to various problems in model training.

Disclosure of Invention

An exemplary embodiment of the present invention is to provide a distributed system for acquiring continuous features and a method thereof, which can reduce the feature amount of continuous features obtained after a serialization process and can increase the speed of the serialization process.

According to an exemplary embodiment of the present invention, a distributed system for obtaining continuous features is provided, wherein the distributed system comprises: a plurality of node devices configured to perform serialization processing on at least one discrete field in a data sample of a designated data set in parallel to obtain a continuous feature corresponding to each discrete field; the plurality of node devices acquire data samples needing processing of each round according to the ordering of the data samples in the designated data set, the data samples in the earlier ordering are acquired by the node devices, and the data samples acquired by different node devices are different, wherein each node device acquires the characteristic value of continuous characteristics corresponding to the field value of the discrete field in the data samples needing processing of the round based on the historical statistical result of the discrete field acquired from the server of the round, and sends the field value to the server so that the server updates the historical statistical result of the discrete field based on the field value.

Optionally, the distributed system further comprises: and the data order-preserving device is configured to provide the data samples in the specified data set to the plurality of node devices in a streaming mode according to the order so as to provide the plurality of node devices with data samples needing processing of each round.

Optionally, when the data samples in the designated data set have a time sequence, the data samples in the designated data set are sorted in the time sequence, wherein the data order keeping device provides the data samples required to be processed in each round for the plurality of node devices in the time sequence, and the earlier the data samples in the time sequence are provided for the plurality of node devices to be processed.

Optionally, when the data samples in the designated data set do not have a time sequence, the data order preserving device randomly orders the data samples in the designated data set N times, and provides, for each of the original ordering and the N times random ordering of the designated data set, the multiple node devices with the data samples in turn required to be processed by the data samples in the order, so as to obtain, by the multiple node devices, continuous features corresponding to each discrete field for the order, where N is an integer greater than or equal to 0.

Optionally, the distributed system further comprises: a server configured to maintain historical statistics regarding each discrete field requiring continuous processing.

Optionally, the data sample includes a field value and a tag value of the at least one discrete field, where the historical statistics result for each discrete field requiring the serialization processing is: historical statistics regarding the different field values of the discrete fields and the tag values corresponding thereto.

Optionally, each node device obtains a feature value of a continuous feature corresponding to a field value of a discrete field in the data sample that needs to be processed in the current round, and sends the data sample to the server, and when receiving the data sample sent by the node device, the server deletes the data sample that is stored first and meets a preset condition from the stored data samples, and stores the data sample received this time, so as to calculate a historical statistical result about each discrete field that needs to be processed continuously based on the updated data sample.

Optionally, each node device obtains a feature value of a continuous feature corresponding to a field value of a discrete field in the data sample that needs to be processed in the current round, and sends the data sample to the server, and when receiving the data sample sent by the node device, the server performs attenuation processing on a current historical statistical result about each discrete field that needs to be processed continuously, and updates the historical statistical result after the attenuation processing based on the data sample received this time.

Optionally, the data order preserving device divides the data samples of the specified data set into a plurality of slices for storage, and the data order preserving device randomly orders the data samples in the specified data set N times by randomly ordering the plurality of slices N times.

Optionally, the data order preserving device is further configured to preserve N times of randomly ordered results of the plurality of slices, so as to reproduce the process of obtaining continuous features in the following.

Optionally, the data order-preserving device comprises a plurality of order-preserving clients, wherein the plurality of order-preserving clients correspond to the plurality of node devices one to one, and the node devices are integrated with the corresponding order-preserving clients, wherein each order-preserving client is configured to provide partial data samples in the designated data set to the corresponding node device in a streaming manner according to an order, so as to provide each round of data samples needing to be processed for the corresponding node device, and different order-preserving clients correspond to different data samples.

According to another exemplary embodiment of the present invention, a method for acquiring continuous features is provided, wherein the method comprises: the method comprises the following steps that a plurality of node devices continuously process at least one discrete field in data samples of a designated data set in parallel to obtain continuous characteristics corresponding to each discrete field, wherein the plurality of node devices process the data samples according to the sequence of the data samples in the designated data set, the data samples in the designated data set are processed by the node devices earlier, and the data samples processed by different node devices are different, wherein the step of continuously processing at least one discrete field in the data samples in the designated data set by the plurality of node devices in parallel comprises the following steps: for each node device, the node device acquires data samples which need to be processed in the current round; the node device acquires a feature value of a continuous feature corresponding to a field value of each discrete field in the data samples needing continuous processing in the current round based on the historical statistical result of the discrete field acquired from the server in the current round, and transmits the field value to the server so as to update the historical statistical result of the discrete field based on the field value.

Optionally, the method further comprises: and the data order-preserving device provides the data samples in the specified data set to the plurality of node devices in a streaming mode according to the order, so as to provide the data samples needing to be processed in each round for the plurality of node devices.

Optionally, when the data samples in the designated data set have a time sequence, the data samples in the designated data set are sorted in time sequence, wherein the step of the data ordering apparatus providing the data samples in the designated data set to the plurality of node apparatuses in a streaming form according to the sorting comprises: the data order-keeping device provides the data samples which need to be processed in each round for the plurality of node devices according to the time sequence, wherein the earlier the data samples in the time sequence are provided for the plurality of node devices to be processed.

Optionally, when the data samples in the designated data set do not have a time sequence, the step of providing the data samples in the designated data set to the plurality of node devices in a streaming form according to the sequence by the data order keeping device comprises: and the data order-preserving device randomly orders the data samples in the specified data set for N times, provides data samples needing processing of each round for the plurality of node devices according to the ordering aiming at each ordering in the original ordering and the N times of random ordering of the specified data set, and obtains continuous characteristics corresponding to each discrete field aiming at the ordering by the plurality of node devices, wherein N is an integer greater than or equal to 0.

Optionally, the method further comprises: the server maintains historical statistics for each discrete field that requires serialization processing.

Optionally, the step of each node device obtaining a feature value of a continuous feature corresponding to a field value of a discrete field in the data sample required to be processed in the current round and sending the data sample to the server, and the server maintaining a historical statistical result about each discrete field required to be processed continuously includes: and when receiving the data samples sent by the node device, the server deletes the data sample which is stored firstly and meets the preset condition from the stored data samples, stores the data sample received this time, and calculates the historical statistical result of each discrete field needing continuous processing based on the updated data sample.

Optionally, the step of each node device obtaining a feature value of a continuous feature corresponding to a field value of a discrete field in the data sample required to be processed in the current round and sending the data sample to the server, and the server maintaining a historical statistical result about each discrete field required to be processed continuously includes: and when receiving the data samples sent by the node device, the server performs attenuation processing on the current historical statistical result of each discrete field needing continuous processing, and updates the historical statistical result after the attenuation processing on the basis of the data samples received this time.

Optionally, the method further comprises: the data order preserving device divides the data samples of the specified data set into a plurality of shards for storage, wherein the step of randomly ordering the data samples in the specified data set by the data order preserving device for N times comprises the following steps: and the data order-preserving device randomly orders the plurality of fragments N times.

Optionally, the method further comprises: and the data order-preserving device saves the N times of random ordering results of the plurality of fragments so as to reproduce the process of acquiring continuous characteristics in the following.

Optionally, the data order-preserving device includes a plurality of order-preserving clients, wherein the plurality of order-preserving clients correspond to the plurality of node devices one to one, and the node devices are integrated with the corresponding order-preserving clients, and the step of providing the data samples in the designated data set to the plurality of node devices in a streaming form according to the order by the data order-preserving device includes: and for each order-preserving client, the order-preserving client provides partial data samples in the specified data set to the corresponding node devices in a streaming mode according to the sequence so as to provide each round of data samples needing to be processed for the corresponding node devices, wherein different order-preserving clients correspond to different data samples.

According to the distributed system and the method for acquiring the continuous features, the continuous processing of the discrete fields in the distributed scene can be realized to acquire the continuous features, and the efficiency of the continuous processing is improved. By using the universal framework for acquiring the continuous features, the continuous features can be acquired for the time sequence samples and the non-time sequence samples, the correctness of acquiring the continuous features for the time sequence samples can be ensured to a certain extent, and the processing precision of acquiring the continuous features for the non-time sequence samples is improved. In addition, the user only needs to set a specific mode (for example, a statistical mode, a decay operation, a window operation and the like) for acquiring the historical statistical result of the discrete fields needing continuous processing based on the single-thread logic, and continuous features meeting the requirements of the user can be automatically provided, so that the personalized requirements (for example, the requirement on data timeliness) of the user are met, and the development threshold is lowered.

Additional aspects and/or advantages of the present general inventive concept will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the general inventive concept.

Drawings

The above and other objects and features of exemplary embodiments of the present invention will become more apparent from the following description taken in conjunction with the accompanying drawings which illustrate exemplary embodiments, wherein:

FIG. 1 illustrates a block diagram of a distributed system for obtaining continuous features in accordance with an exemplary embodiment of the present invention;

FIG. 2 shows a block diagram of a distributed system for obtaining continuous features according to another exemplary embodiment of the present invention;

FIG. 3 shows a block diagram of a distributed system for obtaining continuous features according to another exemplary embodiment of the invention;

FIG. 4 shows a block diagram of a distributed system for obtaining continuous features according to another exemplary embodiment of the present invention;

FIGS. 5 and 6 illustrate an example of an operational flow of a distributed system for obtaining continuous features according to an exemplary embodiment of the present invention;

fig. 7 shows a flowchart of a method for obtaining continuous features according to an exemplary embodiment of the present invention.

Detailed Description

Reference will now be made in detail to the embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the like elements throughout. The embodiments are described below in order to explain the present invention by referring to the figures.

FIG. 1 shows a block diagram of a distributed system for obtaining continuous features according to an exemplary embodiment of the invention. As shown in fig. 1, a distributed system for acquiring continuous features according to an exemplary embodiment of the present invention includes: a plurality of node devices 1000 (e.g., 1000-1, 1000-2, …, 1000-n (where n is an integer greater than 1)).

The plurality of node apparatuses 1000 are configured to perform continuous processing on at least one discrete field in data samples of a specified data set in parallel to obtain continuous features corresponding to each discrete field, wherein the plurality of node apparatuses 1000 obtain data samples required to be processed in each round according to the ordering of the data samples in the specified data set, data samples in the earlier round are obtained by the node apparatuses 1000 for processing, and the data samples obtained by different node apparatuses 1000 are different, wherein each node apparatus 1000 obtains, for each discrete field required to be processed continuously, a feature value of a continuous feature corresponding to a field value of the discrete field in the data samples required to be processed in the round based on historical statistics about the discrete field obtained from a server in the round, and sends the field value to the server, to update, by the server, historical statistics about the discrete field based on the field value.

Specifically, each node apparatus 1000 obtains data samples required to be processed by the node apparatus 1000 in each round, obtains the latest historical statistical result about each discrete field required to be processed continuously from the server, then, based on the obtained historical statistical result, the field value of the discrete field needing continuous processing in the obtained data sample is subjected to continuous processing to obtain the characteristic value of the corresponding continuous characteristic, and the field value subjected to continuous processing in the current round is sent to a server, to be updated by the server with historical statistics on discrete fields that require continuous processing, the node apparatus 1000 then proceeds to the next round, thereby ensuring that the historical statistics used when each node apparatus 1000 performs the serialization process are based on the data previously processed by all node apparatuses 1000. It should be understood that all node devices 1000 may enter the next round together after completing the round, or node devices 1000 may enter the next round immediately after completing the round, so that the node devices 1000 with stronger processing capability process more data samples. It should be understood that each round referred to herein refers to processing a small portion (batch) of the data samples in the given data set, rather than processing all of the data samples in a single pass of the given data set.

Here, it should be understood that the plurality of node apparatuses 1000 may perform a serialization process on a discrete field requiring a serialization process for each of at least one sort of data samples of a given data set to obtain one continuous feature of the discrete field corresponding to each sort. In other words, for each discrete field requiring serialization processing, a corresponding one or more continuous features may be obtained, each of the plurality of continuous features being based on one ordering of data samples in a given dataset, different ones of the plurality of continuous features corresponding to different orderings. As an example, when the data samples in the specified data set can be sorted in time sequence, one continuous feature corresponding to the time sequence sorting can be obtained for each discrete field needing continuous processing; when the data samples in the designated data set cannot be sorted in time sequence due to lack of time sequence, a plurality of continuous characteristics corresponding to each discrete field needing continuous processing can be obtained. As an example, the data samples in a given data set may be ordered chronologically (i.e., earlier in time, earlier in ordering), or otherwise (e.g., randomly or initially by default).

In the process of obtaining corresponding continuous features for one sort of data samples in a specified data set, the multiple node devices 1000 may sequentially obtain the data samples in the specified data set according to the sort as data samples processed in each round of the data samples, the earlier data samples in the sort are obtained by the node devices 1000, and the data samples obtained by different node devices 1000 are different, and the data samples obtained by the node devices 1000 will not be obtained any more later, that is, each node device 1000 respectively processes a part of data samples in the specified data set for the same discrete field needing continuous processing, the data samples processed by different node devices 1000 have no intersection, and the aggregate of the data samples processed by different node devices 1000 is exactly the specified data set.

For example, assuming that a distributed system for acquiring continuous features according to an exemplary embodiment of the present invention includes 3 node apparatuses 1000-1, 1000-2, and 1000-3, when the node apparatuses 1000-1, 1000-2, and 1000-3 enter the first round, the node apparatus 1000-1 acquires the first 10 data samples for processing, the node apparatus 1000-2 acquires the 11 th to 20 th data samples for processing, and the node apparatus 1000-3 acquires the 21 st to 30 th data samples for processing, according to the current sorting; when the node devices 1000-1, 1000-2, and 1000-3 enter the second round, the node device 1000-1 obtains the 31 st to 40 th data samples for processing, the node device 1000-2 obtains the 41 st to 50 th data samples for processing, the node device 1000-3 obtains the 51 st to 60 th data samples for processing, and so on, the multiple rounds of processing are performed until all the data samples in the designated data set are processed completely. It should be understood that each node device 1000 may acquire one data sample or a data block containing multiple data samples per round, and that the number of data samples acquired by different node devices 1000 per round may be the same or different.

As an example, the historical statistics about discrete fields that require serialization processing may be: the statistical result of the history of the field value of the discrete field is statistical result of the field value of the discrete field which has been processed continuously. For example, the data sample may include a field value and a label (label) value of the at least one discrete field (i.e., a field value of the prediction target field), and the historical statistics regarding the discrete fields requiring the serialization processing may be: the historical statistics regarding the different field values of the discrete field and the corresponding tag values may be, for example, a positive sample ratio of the different field values over the history of the discrete field.

As an example, the server may be a parameter server.

FIG. 2 shows a block diagram of a distributed system for obtaining continuous features according to another exemplary embodiment of the invention. As shown in fig. 2, a distributed system for acquiring continuous features according to another exemplary embodiment of the present invention includes: a plurality of node apparatuses 1000 and a data order-preserving apparatus 2000.

The data order keeping device 2000 is configured to provide the data samples in the designated data set to the plurality of node devices 1000 in a streaming form according to the ordering, so as to provide the plurality of node devices 1000 with data samples required to be processed in each round.

Specifically, the data order keeping apparatus 2000 is configured to provide each round of data samples required to be processed by the node apparatuses 1000 for the node apparatuses 1000, and each time the node apparatuses 1000 complete one round of processing, the data order keeping apparatus 2000 provides the node apparatuses 1000 with the data samples required to be processed by the next round of processing, in other words, the data order keeping apparatus 2000 is capable of controlling the order in which the data samples in the designated data set are processed in the node apparatuses 1000, and further, the data order keeping apparatus 2000 is capable of ordering the data samples in the designated data set to provide each round of data samples required to be processed in a streaming manner for the node apparatuses 1000 according to the ordering.

In the process of obtaining corresponding continuous features for a sort of data samples in a given data set, the data ordering apparatus 2000 provides the multiple node apparatuses 1000 with data samples required to be processed in each round in a streaming manner according to the sort: the data samples in the designated data set are sequentially provided to the plurality of node apparatuses 1000 according to the order, that is, the data samples are provided for each round of calculation of the plurality of node apparatuses 1000, the data samples in the earlier order are provided to the node apparatuses 1000 for processing, and the data samples provided to different node apparatuses 1000 are different, and the data samples provided to the node apparatuses 1000 are not provided repeatedly.

As an example, the data order preserving device 2000 may determine whether the data samples in the designated data set have a time sequence, and perform different processing according to different determination results. For example, it may be determined that the data samples in the specified data set have a timing when the specified data set has a time field. In addition, whether the data samples in the designated data set have a time sequence can also be judged by other modes.

In one embodiment, when the data samples in the designated data set have a time sequence, and the data samples in the designated data set are sorted in the time sequence, the data ordering apparatus 2000 may provide the plurality of node apparatuses 1000 with the data samples required to be processed in each round in the time sequence, wherein the earlier the time sequence (i.e., the earlier the corresponding time is), the earlier the data samples are provided to the plurality of node apparatuses 1000 for processing, and the data samples processed by different node apparatuses 1000 are different. For example, the data order preserving apparatus 2000 may sort the data samples in the designated data set according to a time sequence, or the designated data set itself stores the data samples according to the sequence time of the data samples flowing in without sorting according to the time sequence, so that the data order preserving apparatus 2000 may sequentially provide the data samples for each round of calculation of the plurality of node apparatuses 1000 according to the sorting of the data samples in the designated data set, so as to provide the plurality of node apparatuses 1000 with the data samples required to be processed in each round according to the time sequence. According to the exemplary embodiment of the present invention, it can be ensured that data samples are processed by the node apparatus 1000 in a time sequence in a distributed scenario, that is, it can be ensured that sample data currently processed by the node apparatus 1000 is sample data generated in the same time period, so as to improve the correctness of the historical statistics result about the sample data for calculating the feature value, and prevent a crossing problem (that is, information of future sample data is used by currently processed sample data).

In another embodiment, when the data samples in the designated data set do not have timing, the data order preserving device randomly orders the data samples in the designated data set N times, and provides the multiple node devices with the data samples in turn required to be processed according to each of the original ordering and the N times random ordering of the designated data set, so as to obtain continuous features corresponding to each discrete field for the ordering through the multiple node devices, wherein N is an integer greater than or equal to 0. In other words, in this embodiment, when the data samples in the specified data set do not have time sequence, for each discrete field requiring continuous processing, N +1 corresponding continuous features will be obtained, and the feature values of different continuous features of the discrete fields are obtained based on different orderings of the data samples of the specified data set. In this embodiment, for the case that the data samples in the designated data set do not have time sequence (that is, the discrete fields that need to be continuously processed are non-time-sequence discrete fields), an order can be designated for the data samples through random sorting, so that the non-time-sequence discrete fields are converted into a plurality of continuous features, and the adaptability of the model to the discrete fields is improved. Further, it should be understood that the server, in maintaining historical statistics regarding each discrete field requiring serialization processing, should maintain separately for each ordering of data samples in a given data set.

As an example, the data ordering apparatus 2000 may divide the data samples of the specified data set into a plurality of shards (boards) for storage, and the data ordering apparatus 2000 may randomly order the data samples of the specified data set N times by randomly ordering the plurality of shards N times. For example, if 30000 data samples in a given dataset are stored, the data samples may be divided into 3 shards for storage, and each shard may store 10000 data samples.

As an example, the data ordering apparatus 2000 may be further configured to save the ordering result of the N random orderings for subsequent reproduction of the process of acquiring the continuous features. For example, when the data sample of the specified data set is divided into a plurality of segments for storage, N times of random ordering results of the plurality of segments may be stored, so as to reproduce the process of obtaining continuous features in the following, in this embodiment, when a shuffle is performed, the shuffle may be performed with the shard as a granularity, instead of performing the random ordering with the data sample as the granularity, so that an effect of largely disordering the order may be achieved, almost no additional overhead is caused (the order of reading the plurality of shards needs to be changed), and the data order preserving apparatus 2000 is also convenient to record the ordering of the data sample each time (only the reading order of the shard needs to be recorded), and reproduce the feature processing result in the shuffle scene.

FIG. 3 shows a block diagram of a distributed system for obtaining continuous features according to another exemplary embodiment of the invention. As shown in fig. 3, a distributed system for acquiring continuous features according to another exemplary embodiment of the present invention includes: a plurality of node apparatuses 1000, and a server 3000.

The server 3000 is configured to maintain historical statistics regarding each discrete field that requires continuous processing. The server 3000 obtains, for each discrete field requiring serialization processing, a historical statistical result for the discrete field based on the field value of the discrete field that has been received from the plurality of node apparatuses 1000, for one sort of data samples of a specified data set, and specifically, updates the historical statistical result for the discrete field based on the received field value of the discrete field each time the field value of the discrete field is received from the node apparatus 1000. That is, in the distributed scenario, the server 3000 obtains the global historical statistical result regarding the discrete fields requiring the serialization processing by unified calculation based on the data received from the node apparatuses 1000, so that each node apparatus 1000 can obtain the latest historical statistical result from the server 3000 when processing the data sample to obtain the feature value, and it can be ensured that the historical statistical result used when each node apparatus 1000 performs the serialization processing is obtained based on the data that has been processed by all the node apparatuses 1000.

By way of example, server 3000 may be a parameter server.

As an example, the statistical function for obtaining the historical statistical result may be a statistical function supporting the joint law of switching laws, accordingly, the server 3000 only needs to store the statistical sum and the statistical number obtained based on the historical samples, when the server 3000 receives the field value of the discrete field sent by the node apparatus 1000, the stored statistical sum and the data received this time only need to be merged according to a specified manner (for example, direct or after-mentioned decay operation), and the statistical number is accumulated according to the specified manner, and when the historical statistical result needs to be sent to the node apparatus 1000, the result calculated by using the statistical function (for example, averaging) based on the currently stored statistical sum and the statistical number may be sent to the node apparatus 1000.

By way of example, when historical statistics about discrete fields that require serialization processing are: when the historical statistics result about the different field values of the discrete fields and the tag values corresponding thereto are obtained, the node apparatus 1000 acquires the feature values of the continuous features corresponding to the field values of the discrete fields in the data samples whose processing is required in the current round, and transmits the data samples whose processing is required in the current round to the server 3000 to update the historical statistics result about the at least one discrete field by the server 3000 based on the field value and the tag value of the at least one discrete field in the data samples. It should be understood that if the specified data set also includes consecutive fields, the field values of the consecutive fields may not be sent to the server 3000.

In one embodiment, when receiving the data samples transmitted by the node apparatus 1000, the server 3000 may delete the data sample which is stored first and meets the preset condition from the stored data samples, and store the data sample received this time, so as to calculate the historical statistical result about each discrete field which needs to be processed continuously based on the updated data sample. For example, the data samples which are stored first and meet the preset condition may be the data samples which are stored first and have the same number as the data samples received this time, or the data samples which are stored before the last preset time. For example, when receiving data samples transmitted from the node apparatus 1000, the server 3000 may delete the data samples, which are stored first and received this time, in the same number as the data samples received this time from the stored data samples, and store the data samples received this time to calculate the historical statistics for each discrete field that needs to be continuously processed based on the updated data samples. For example, when receiving data samples transmitted by the node apparatus 1000, the server 3000 may delete data samples stored before the last preset time period from the stored data samples (for example, delete data samples stored before the last minute), and store the data samples received this time to calculate a historical statistical result for each discrete field requiring the serialization processing based on the updated data samples.

According to the exemplary embodiment of the present invention, a window operation is added when the server 3000 updates the historical statistical result, so as to fully utilize the timeliness of the data sample, so that the obtained continuous features can better reflect the influence caused by the timeliness, and the statistical operation for obtaining the historical statistical result can be more flexible, that is, a statistical function that does not support the commutative law binding law can be supported to a certain extent. For example, the server 3000 may record only the most recently received 100 data samples or the data samples received within the most recent 1 minute, and since the server has the total amount of data to be processed currently, the process of obtaining the historical statistical result may support any form of statistical function, and although the memory is consumed, the consumption may be acceptable in certain situations.

In another embodiment, when receiving the data sample sent by the node apparatus 1000, the server 3000 may perform attenuation processing on the current historical statistical result of each discrete field that needs to be processed continuously, and update the historical statistical result after the attenuation processing based on the data sample received this time. According to the exemplary embodiment, the influence of long-time history samples on currently processed samples can be reduced, so that the timeliness of data can be fully utilized.

Further, as an example, a distributed system for acquiring continuous features according to another exemplary embodiment of the present invention includes: a plurality of node apparatuses 1000, a data sort-keeping apparatus 2000, and a server 3000.

It should be understood that at least one of the node apparatus 1000, the data order preserving apparatus 2000, and the server 3000 is defined by the processing or implemented functions performed by it, and may indicate either a physical entity or a virtual entity, for example, the node apparatus 1000 may indicate an actual physical computing machine or a logical entity deployed on the computing machine, and likewise, the server 3000 may indicate an actual physical computing machine or may be deployed as one or more logical entities on the same and/or different computing machines as the computing apparatus 1000 and/or the data order preserving apparatus 2000, and likewise, the data order preserving apparatus 2000 may indicate an actual physical computing machine or may be deployed as one or more logical entities on the same and/or different computing machines as the computing apparatus 1000 and/or the server 3000. By way of example, server 3000 may be deployed on a single computing machine; alternatively, server 3000 may be deployed on multiple computing machines concurrently; the data order keeping apparatus 2000 may be deployed on a single computing machine; alternatively, the data order keeping apparatus 2000 may be deployed on multiple computing machines simultaneously. By way of example, portions of the server 3000 and/or portions of the data order keeping apparatus 2000 may be deployed on the same physical computing machine as each node apparatus 1000.

As an example, the data order keeping apparatus 2000 may operate on a different physical computing machine from the node apparatus 1000, and accordingly, the data order keeping apparatus 2000 only needs to distribute the data samples to each node apparatus 1000 in sequence through the network, and when the process of obtaining the continuous characteristic for the specified data set is operated again later, it is also certain that the order in which the data samples are processed is the same as this time to realize the recurrence, for example, three data samples processed in the first round are still data samples 1, 2, and 3. In this way, the scene of on-line training is simulated, namely, data is generated in real time and is sent to the module responsible for feature processing through the network.

FIG. 4 shows a block diagram of a distributed system for obtaining continuous features according to another exemplary embodiment of the invention.

As shown in fig. 4, the data order-preserving apparatus 2000 may include a plurality of order-preserving clients 4000 (e.g., 4000-1, 4000-2, …, 4000-n), wherein the plurality of order-preserving clients 4000 correspond to the plurality of node apparatuses 1000 one-to-one, and the node apparatus 1000 is integrated with the corresponding order-preserving client 4000, wherein each order-preserving client 4000 is configured to provide partial data samples in the designated data set to the corresponding node apparatus 1000 in a streaming form according to an order to provide the corresponding node apparatus 1000 with data samples required to be processed by each round, wherein different order-preserving clients 4000 correspond to different data samples.

It should be understood that each of the order-preserving clients 4000 manages a portion of the data samples of a given data set, the data samples managed by different order-preserving clients 4000 do not intersect, and the aggregate of the data samples managed by different order-preserving clients 4000 is exactly the given data set.

As an example, the node apparatuses 1000 and the corresponding order-preserving clients 4000 may be deployed on the same physical computing machine, i.e., the data samples that each node apparatus 1000 is required to process may be stored on the physical computing machine on which it is located.

In one embodiment, when the data samples in the designated data set have a time sequence, all the data samples that the respective order-preserving clients 4000 need to manage may be determined according to the time sequence of the data samples, and each order-preserving client 4000 provides the corresponding node apparatus 1000 with the data samples that the respective round thereof needs to process according to the time sequence, and the earlier data samples in the time sequence are provided to the corresponding node apparatus 1000 for processing, so as to ensure that the earlier data samples in the time sequence are provided to the plurality of node apparatuses 1000 for processing when the plurality of node apparatuses 1000 process the data samples in parallel. For example, the data samples in the designated data set are sorted in time sequence, and the top 10 data samples, the 31 st to 40 th data samples, … … th data samples and 4000-1 assigned to the order-preserving client corresponding to the node device 1000-1 can be managed; the 11 th to 20 th data samples, the 41 th to 50 th data samples, … …, and the 4000-2 order-preserving client corresponding to the node device 1000-2 can be assigned for management; the 21 st to 30 th data samples, the 51 st to 60 th data samples, … … th data samples, and the assigned order-preserving client 4000-3 corresponding to the node apparatus 1000-3 can be managed. So that in the first round, the order-preserving client 4000-1 provides the first 10 data samples to the node device 1000-1 for processing, the order-preserving client 4000-2 provides the 11 th to 20 th data samples to the node device 1000-2 for processing, and the order-preserving client 4000-3 provides the 21 st to 30 th data samples to the node device 1000-3 for processing; in the second round, the order-preserving client 4000-1 provides 31 st to 40 th data samples to the node device 1000-1 for processing, the order-preserving client 4000-2 provides 41 st to 50 th data samples to the node device 1000-2 for processing, the order-preserving client 4000-3 provides 51 st to 60 th data samples to the node device 1000-3 for processing, and so on. In addition, the order preserving client 4000 can also ensure that data samples can still be processed in this order when they are later replicated.

In another embodiment, when the data samples in the designated data set do not have a time sequence, each order-preserving client 4000 may randomly sort all the data samples it manages N times, and for each of the original sort and the N times random sort of the data samples, provide the corresponding node apparatus 1000 with the data samples of each round that need to be processed according to the sort to obtain the continuous features corresponding to each discrete field for the sort through the plurality of node apparatuses 1000.

As an example, each of the order-preserving clients 4000 may divide all data samples managed by it into a plurality of shards for storage, and each of the order-preserving clients 4000 may randomly order all data samples managed by it N times by randomly ordering the plurality of shards N times.

Fig. 5 and 6 illustrate an example of an operation flow of a distributed system for acquiring continuous features according to an exemplary embodiment of the present invention.

As shown in fig. 5, when the distributed system initially operates, the designated data set is stored by the data-preserving device 2000, and the data samples in the designated data set are sorted in time sequence, where the server 3000 is in an empty state (fig. 5 shows 0 for clarity of the empty state, but generally, only the actually processed values are stored in the server 3000 due to the large discrete dimension). The data ordering apparatus 2000 starts to provide the data sample to the node apparatus 1000 in a streaming form, and in the first round, the node apparatus 1000-1 takes the data sample 1, where the label value (label) of the data sample 1 is 1 and the field value of the included discrete field a is a; node device 1000-2 receives data sample 2 with label value 0, and contains discrete field a with field value b; the node apparatus 1000-3 receives a data sample 3 having a tag value of 1, and includes a discrete field a having a field value of a.

After each node device 1000 receives a data sample, it first obtains a historical statistical result about the discrete field a from the server 3000 (for example, a positive sample rate about different field values of the field a), since the server 3000 is empty at this time, the returned result received by each node device 1000 is 0, accordingly, the feature values of the continuous features a' obtained by each node device 1000 processing the respective data sample based on the result returned by the server 3000 are all 0.0, and after the feature values are obtained, the field values and the tag values of the discrete field a processed by each node device 1000 are sent to the server 3000 for statistics, so that the server 3000 stores statistical values of different field values of the discrete field a in the first three data samples, wherein the occurrence frequency (count) of the field value a is 2, and the number (sum) of positive samples corresponding to the field value a is 2; the number of occurrences of the field value b is 1, and the number of positive samples corresponding to the field value b is 0, as shown in fig. 6.

After this round is completed, the second round is entered, and as shown in fig. 6, the data ordering apparatus 2000 sequentially sends the next data samples to each node apparatus 1000, that is, the data ordering apparatus 2000 sends the data samples 4, 5, and 6 to each node apparatus 1000, each node apparatus 1000 obtains the current positive sample rate of the field a from the server 3000, for example, the positive sample rate of the field a obtained from the server 3000 by the node apparatus 1000-1 for the data sample 4 (the field value of the field a is a) and the positive sample rate of the field value a of the field a obtained from the server 3000 by the node apparatus 1000-3 for the data sample 6 (the field value of the field a is a) is 1(2 ÷ 2), accordingly, the feature value of the continuous feature a 'corresponding to the field value of the field a of the data sample 4 obtained by the node apparatus 1000-1 based on the positive sample rate is 1.0, and the feature a' corresponding to the field value of the field a of the data sample 6 obtained by the node apparatus 1000-3 based on the positive sample rate is 1. The characteristic value of the continuous characteristic a' is 1.0. The positive sample rate of the field value b on the field a obtained from the server 3000 by the node apparatus 1000-2 for the data sample 5 (the field value of the field a is b) is 0(0 ÷ 1), accordingly, the feature value of the continuous feature a' corresponding to the field value of the field a of the data sample 5, which the node apparatus 1000-2 acquires based on the positive sample rate, is 0.0, and the node apparatus 1000 transmits the processed data samples 4, 5, 6 to the server 3000 to update the positive sample rate of the field value on the field a by the server 3000. The above process of the loop can finally convert the field values of the discrete fields a in all the data samples into the feature values of the corresponding continuous features, and then can be used by the subsequent steps, for example, training by a decision tree model.

In fact, when each node apparatus 1000 acquires the historical statistical result about the discrete field a from the server 3000 in the second round, the server 3000 stores the historical statistical result about the discrete field a obtained based on the data samples 1, 2 and 3, and accordingly, each node apparatus 1000 processes the data samples 4, 5 and 6 by using the historical statistical result about the discrete field a obtained based on the data samples 1, 2 and 3, and therefore, the data samples 1, 2 and 3 are historical samples for the data samples 4, 5 and 6, and therefore, the crossing problem does not exist. Although historical statistics on the discrete field a based on data samples 1, 2, 3, 4 should theoretically be used for data sample 5, slight errors are tolerable in large data scenarios and overfitting can also be prevented to some extent.

In addition, as an example, when the server 3000 updates the historical statistics result about the discrete field a, attenuation operations may be added accordingly, for example, each time data pushed by the node device 1000 is received, an original corresponding statistics value may be multiplied by an attenuation coefficient (for example, 0.99), and then a new received correlation value is added, for example, after the server 3000 updates the historical statistics value for the second time, the historical statistics value on the server 3000 may become: the sum of the field value a of the field a is 2 × 0.99+2 — 3.98, and the corresponding count of the field value a is 2 × 0.99+2 — 3.98; the sum 0.99+0 of the field value b of the field a is 0, and the count 1 0.99+1 of the field value b is 1.99. After attenuation operation, the proportion of the first 3 data samples is reduced to 0.99 times of the original proportion, and as the updated data volume is larger, the earlier updated data samples occupy lower weight ratio, so that the importance of the data closer to the current time is higher, and the timeliness of the data is reflected.

Furthermore, when the experiment needs to be repeated (i.e., the process of acquiring the continuous features needs to be repeated), the data-order-preserving apparatus 2000 can ensure that the data samples 1, 2, 3 are still processed in the first round, the data samples 4, 5, 6 are still processed in the second round, and the historical statistics obtained based on the data samples 1, 2, 3 are used for processing in the next processing process, so that the same result as the previous one can be obtained even in a distributed environment, thereby increasing the reliability and interpretability of the experiment.

It should be understood that if the data sample includes a plurality of discrete fields requiring continuous processing, the node apparatus 1000 may obtain, when processing each data sample, a feature value of a continuous feature corresponding to a field value of each discrete field in the data sample, for example, in the above example, a field value of a discrete field B may also be included in each data sample, so that after processing one data sample, one feature value of a corresponding continuous feature a 'and one feature value of a continuous feature B' corresponding to the discrete field B may be obtained. Accordingly, the server 3000 may simultaneously maintain historical statistics for each discrete field that requires serialization processing, e.g., the server 3000 may simultaneously maintain historical statistics for discrete field B in addition to the historical statistics for discrete field a.

Referring to fig. 7, in step S10, a plurality of node devices process data samples according to the order of the data samples in the designated data set, wherein the earlier the data samples are processed by the node devices, and the data samples processed by different node devices are different, perform serialization processing on at least one discrete field in the data samples of the designated data set in parallel to obtain a continuous feature corresponding to each discrete field.

Specifically, step S10 includes: for each node device, the node device acquires data samples which need to be processed in the current round; the node device acquires a feature value of a continuous feature corresponding to a field value of each discrete field in the data samples needing continuous processing in the current round based on the historical statistical result of the discrete field acquired from the server in the current round, and transmits the field value to the server so as to update the historical statistical result of the discrete field based on the field value.

As an example, the method for acquiring continuous features according to an exemplary embodiment of the present invention may further include: and the data order-preserving device provides the data samples in the specified data set to the plurality of node devices in a streaming mode according to the order, so as to provide the data samples needing to be processed in each round for the plurality of node devices.

As an example, when the data samples in the designated data set have a time sequence, the data samples in the designated data set are sorted by time sequence, wherein the step of the data ordering apparatus providing the data samples in the designated data set to the plurality of node apparatuses in a streaming form according to the sorting may include: the data order-keeping device provides the data samples which need to be processed in each round for the plurality of node devices according to the time sequence, wherein the earlier the data samples in the time sequence are provided for the plurality of node devices to be processed.

As an example, when the data samples in the designated data set do not have timing, the step of the data order keeping device providing the data samples in the designated data set to the plurality of node devices in a streaming form according to the order may include: and the data order-preserving device randomly orders the data samples in the specified data set for N times, provides data samples needing processing of each round for the plurality of node devices according to the ordering aiming at each ordering in the original ordering and the N times of random ordering of the specified data set, and obtains continuous characteristics corresponding to each discrete field aiming at the ordering by the plurality of node devices, wherein N is an integer greater than or equal to 0.

As an example, the method for acquiring continuous features according to an exemplary embodiment of the present invention may further include: the server maintains historical statistics for each discrete field that requires serialization processing.

As an example, the data sample may include a field value and a tag value of the at least one discrete field, wherein the historical statistics for each discrete field requiring serialization processing may be: historical statistics regarding the different field values of the discrete fields and the tag values corresponding thereto.

As an example, each node device obtains a feature value of a continuous feature corresponding to a field value of a discrete field in a data sample whose processing is required for the current round, and transmits the data sample to a server, and the step of the server maintaining a historical statistical result about each discrete field requiring continuous processing may include: and when receiving the data samples sent by the node device, the server deletes the data sample which is stored firstly and meets the preset condition from the stored data samples, stores the data sample received this time, and calculates the historical statistical result of each discrete field needing continuous processing based on the updated data sample.

As an example, each node device obtains a feature value of a continuous feature corresponding to a field value of a discrete field in a data sample whose processing is required for the current round, and transmits the data sample to a server, and the step of the server maintaining a historical statistical result about each discrete field requiring continuous processing may include: and when receiving the data samples sent by the node device, the server performs attenuation processing on the current historical statistical result of each discrete field needing continuous processing, and updates the historical statistical result after the attenuation processing on the basis of the data samples received this time.

As an example, the method for acquiring continuous features according to an exemplary embodiment of the present invention may further include: the data order-preserving device divides the data samples of the specified data set into a plurality of shards for storage, wherein the step of randomly ordering the data samples in the specified data set by the data order-preserving device for N times may include: and the data order-preserving device randomly orders the plurality of fragments N times.

As an example, the method for acquiring continuous features according to an exemplary embodiment of the present invention may further include: and the data order-preserving device saves the N times of random ordering results of the plurality of fragments so as to reproduce the process of acquiring continuous characteristics in the following.

As an example, the data order-preserving device may include a plurality of order-preserving clients, wherein the plurality of order-preserving clients correspond to the plurality of node devices in a one-to-one manner, and the node devices are integrated with the corresponding order-preserving clients, and the step of providing the data samples in the designated data set to the plurality of node devices in a streaming manner according to the order by the data order-preserving device may include: and for each order-preserving client, the order-preserving client provides partial data samples in the specified data set to the corresponding node devices in a streaming mode according to the sequence so as to provide each round of data samples needing to be processed for the corresponding node devices, wherein different order-preserving clients correspond to different data samples.

It should be understood that the steps involved in the above method may be performed by the node apparatus 1000, the data order preserving apparatus 2000, and the server 3000 in the distributed system described previously, and the operations involved in the above steps have been described in detail with reference to fig. 1 to 6, and details thereof will not be described again here.

It should be understood that the node apparatus, the data order preserving apparatus, and the server or the apparatuses or units constituting them in the distributed system according to the exemplary embodiment of the present invention may be respectively configured as software, hardware, firmware, or any combination of the above to perform a specific function. For example, these components may correspond to application specific integrated circuits, to pure software code, or to modules combining software and hardware. When they are implemented in software, firmware, middleware or microcode, the program code or code segments to perform the corresponding operations may be stored in a computer-readable medium such as a storage medium, so that a processor may perform the corresponding operations by reading and executing the corresponding program code or code segments. Further, one or more functions implemented by these components may also be performed collectively by components in a physical device (e.g., a computing machine, etc.).

It should be noted that the distributed system according to the exemplary embodiment of the present invention may completely depend on the execution of the computer program to realize the corresponding functions, that is, the respective components correspond to the respective steps in the functional architecture of the computer program, so that the entire system is called by a special software package (for example, lib library) to realize the corresponding functions.

While exemplary embodiments of the invention have been described above, it should be understood that the above description is illustrative only and not exhaustive, and that the invention is not limited to the exemplary embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the invention. Therefore, the protection scope of the present invention should be subject to the scope of the claims.

Claims

1. A distributed system for obtaining continuous features, wherein the distributed system comprises:

a plurality of node devices configured to perform serialization processing on at least one discrete field in a data sample of a designated data set in parallel to obtain a continuous feature corresponding to each discrete field;

wherein, the plurality of node devices obtain the data samples required to be processed in each round according to the sequence of the data samples in the appointed data set, the earlier the data samples in the sequence are obtained by the node devices, and the data samples obtained by different node devices are different,

wherein, each node device acquires the characteristic value of continuous features corresponding to the field value of each discrete field in the data samples needing continuous processing in the current round based on the historical statistical result of the discrete field acquired from the server in the current round, and transmits the field value to the server so as to update the historical statistical result of the discrete field based on the field value.

2. The distributed system of claim 1, wherein the distributed system further comprises:

and the data order-preserving device is configured to provide the data samples in the specified data set to the plurality of node devices in a streaming mode according to the order so as to provide the plurality of node devices with data samples needing processing of each round.

3. The distributed system of claim 2, wherein when the data samples in the specified data set have a time sequence, the data samples in the specified data set are ordered by time sequence,

the data order-keeping device provides data samples which need to be processed in each round for the plurality of node devices in a time sequence, wherein the earlier the data samples in the time sequence are provided for the plurality of node devices to be processed, the earlier the data samples in the time sequence are provided.

4. The distributed system of claim 2, wherein when the data samples in the specified data set do not have timing, the data order-preserving means randomly orders the data samples in the specified data set N times and provides, for each of the original ordering and the N times random ordering of the specified data set, the plurality of node means with the data samples for which processing is required for each round in the ordering to obtain, by the plurality of node means, a continuous characteristic corresponding to each discrete field for the ordering,

wherein N is an integer greater than or equal to 0.

5. The distributed system of claim 1, wherein the distributed system further comprises:

a server configured to maintain historical statistics regarding each discrete field requiring continuous processing.

6. The distributed system of claim 5 wherein the data sample includes a field value and a tag value for the at least one discrete field,

the historical statistical result of each discrete field needing continuous processing is as follows: historical statistics regarding the different field values of the discrete fields and the tag values corresponding thereto.

7. The distributed system according to claim 6, wherein each node apparatus acquires a feature value of a continuous feature corresponding to a field value of a discrete field in a data sample whose processing is required for its turn, and transmits the data sample to a server,

and when receiving the data samples sent by the node device, the server deletes the data sample which is stored firstly and meets the preset condition from the stored data samples, stores the data sample received this time, and calculates the historical statistical result of each discrete field needing continuous processing based on the updated data sample.

8. The distributed system according to claim 6, wherein each node apparatus acquires a feature value of a continuous feature corresponding to a field value of a discrete field in a data sample whose processing is required for its turn, and transmits the data sample to a server,

and when receiving the data samples sent by the node device, the server performs attenuation processing on the current historical statistical result of each discrete field needing continuous processing, and updates the historical statistical result after the attenuation processing on the basis of the data samples received this time.

9. The distributed system of claim 4 wherein data order preserving means divides data samples of the specified data set into a plurality of sharded stores,

the data order-preserving device randomly orders the data samples in the designated data set N times by randomly ordering the plurality of fragments N times.

10. A method for acquiring continuous features, wherein the method comprises:

a plurality of node devices continuously process at least one discrete field in the data samples of the designated data set in parallel to obtain continuous characteristics corresponding to each discrete field, wherein the plurality of node devices process the data samples according to the sequence of the data samples in the designated data set, the data samples with the sequence higher are processed by the node devices earlier, and the data samples processed by different node devices are different,

wherein the step of the plurality of node apparatuses performing the serialization processing on at least one discrete field in the data samples of the designated data set in parallel comprises:

for each node device, the node device acquires data samples which need to be processed in the current round;

the node device acquires a feature value of a continuous feature corresponding to a field value of each discrete field in the data samples needing continuous processing in the current round based on the historical statistical result of the discrete field acquired from the server in the current round, and transmits the field value to the server so as to update the historical statistical result of the discrete field based on the field value.