WO2017152766A1

WO2017152766A1 - Sample serialization method and device

Info

Publication number: WO2017152766A1
Application number: PCT/CN2017/074624
Authority: WO
Inventors: 周俊
Original assignee: 阿里巴巴集团控股有限公司; 周俊
Priority date: 2016-03-11
Filing date: 2017-02-23
Publication date: 2017-09-14
Also published as: CN107180017A; TW201734838A; CN107180017B; TWI761331B

Abstract

A sample serialization method and device, relating to the technical field of machine training. The method comprises: obtaining character strings in samples to be serialized (110); determining management servers corresponding to the character strings according to correspondences between the character strings and the management servers (120); sending the character strings to the corresponding management servers, such that the management servers convert the received character strings into corresponding serialized IDs according to mapping tables maintained by the management servers, wherein the character strings in the mapping tables maintained by different management servers are different from each other (130); receiving the serialized IDs corresponding to the character strings returned by the management servers (140); and converting character strings in multiple pieces of sample data into corresponding serialized IDs according to the received serialized IDs corresponding to the character strings (150). The method reduces the time for querying serialized IDs of character strings, and thus can reduce the time for serializing samples and improve serialization efficiency.

Description

一种样本序列化方法和装置Sample serialization method and device

技术领域Technical field

本申请涉及机器训练技术领域，特别是涉及一种样本序列化方法和一种样本序列化装置。The present application relates to the field of machine training technology, and in particular to a sample serialization method and a sample serialization device.

背景技术Background technique

在互联网中，基于用户的网络行为能产生大量的数据，而为了研究用户的各种行为习惯等方面，可能会构建各种各样的模型，而为了训练这些模型，一般采用机器学习***。机器学习***中，因为样本数据中各个维度的字符串本身可能不是序列化的ID，比如并不是数字ID，而是根据业务需求进行命名的。那么如果直接对样本数据的字符串进行训练，其计算量相对庞大，资源消耗多。In the Internet, user-based network behavior can generate a large amount of data, and in order to study various behavioral habits of users, various models may be constructed, and in order to train these models, a machine learning system is generally adopted. In machine learning systems, because the strings of the various dimensions in the sample data may not be serialized IDs, such as digital IDs, but are named according to business needs. Then, if the string of the sample data is directly trained, the calculation amount is relatively large, and the resource consumption is large.

因此，为了降低计算量，进行训练之前，需要将所有的样本数据中的字符串转换成序列化ID，比如数字ID。比如一个样本数据是格式如下：Therefore, in order to reduce the amount of calculation, it is necessary to convert all the strings in the sample data into serialized IDs, such as digital IDs, before training. For example, a sample data is in the following format:

一共两列：第一列为label列，该label列记录用户是否点击，若记录为1代表用户点击，若记录为0代表用户没有点击；第二列为特征列，该特征列是该条样本的所有特征，用逗号分隔，例如：There are two columns in total: the first column is the label column, and the label column records whether the user clicks. If the record is 1 for the user click, if the record is 0, the user has no click; the second column is the feature column, and the feature column is the sample. All features, separated by commas, for example:

1 user_id_123,age_1,sex_1,age_comb_city31 user_id_123,age_1,sex_1,age_comb_city3

则需要将其中的“user_id_123,age_1,sex_1,age_comb_city3”全部转换成数字ID，也就是需要建立如下映射关系：Then you need to convert all of the "user_id_123, age_1, sex_1, age_comb_city3" into a digital ID, that is, you need to establish the following mapping relationship:

{字符串集合}->{数字集合}{string collection}->{number collection}

那么前述“user_id_123,age_1,sex_1,age_comb_city3”转换得到的映射关系为：Then the mapping relationship of the above "user_id_123, age_1, sex_1, age_comb_city3" conversion is:

user_id_123->数字X，age_1->数字Y,sex_1->数字Z，age_comb_city3->数字F。 User_id_123->Number X, age_1->Number Y, sex_1->Number Z, age_comb_city3->Number F.

但是，在发明人使用过程中发现，当字符串集合元素非常多时，单机内存装载不下，将样本数据序列化的时间非常常长，比如20亿字符串时，每个机器需要的加载完整的映射表，内存超过40G，序列化的时间也非常长。However, during the inventor's use, it was found that when there are a large number of string collection elements, the single-machine memory cannot be loaded, and the serialization of the sample data is very long. For example, when 2 billion strings are used, each machine needs to load a complete mapping. Table, memory over 40G, serialization time is also very long.

发明内容Summary of the invention

鉴于上述问题，提出了本申请实施例以便提供一种克服上述问题或者至少部分地解决上述问题的一种样本序列化方法和相应的一种样本序列化装置。In view of the above problems, embodiments of the present application have been made in order to provide a sample serialization method and a corresponding sample serialization apparatus that overcome the above problems or at least partially solve the above problems.

为了解决上述问题，本申请公开了一种样本序列化方法，包括：In order to solve the above problem, the present application discloses a sample serialization method, including:

获取待序列化样本中的各个字符串；Get each string in the sample to be serialized;

根据各字符串与各管理服务器之间的对应关系，确定每个字符串对应的管理服务器；Determining, according to the correspondence between each character string and each management server, a management server corresponding to each character string;

将所述字符串发送至相应的管理服务器，以供各管理服务器根据其维护的映射表，将接收到的字符串转化为相应的序列化ID；其中，不同管理服务器维护的映射表中的字符串互不相同；Sending the character string to a corresponding management server, so that each management server converts the received string into a corresponding serialization ID according to the mapping table maintained by the management server; wherein characters in the mapping table maintained by different management servers Strings are different from each other;

接收各个管理服务器返回的对应各个字符串的序列化ID；Receiving a serialized ID corresponding to each character string returned by each management server;

根据接收到的各字符串对应的序列化ID，将各个样本数据中的字符串转换为相应的序列化ID。The character string in each sample data is converted into a corresponding serialization ID according to the serialized ID corresponding to each received character string.

本申请还公开了一种样本序列化方法，包括：The application also discloses a sample serialization method, including:

接收字符串；所述字符串由序列化服务器根据字符串与各管理服务器之间的对应关系发送；所述字符串由序列服务器从样本数据中获取；Receiving a string; the string is sent by the serialization server according to a correspondence between the string and each management server; the string is obtained by the sequence server from the sample data;

根据本地维护的映射表，将所接收到的字符串转换为序列化ID；其中，不同管理服务器维护的映射表中的字符串互不相同；Converting the received string into a serialized ID according to a locally maintained mapping table; wherein the strings in the mapping table maintained by different management servers are different from each other;

将所述字符串对应的序列化ID返回给相应的序列化服务器，以供序列化服务器根据接收到的各字符串对应的序列化ID，将各个样本数据中的字符串转换为相应的序列化ID。Returning the serialized ID corresponding to the string to the corresponding serialization server for ordering The columnization server converts the character string in each sample data into a corresponding serialization ID according to the serialization ID corresponding to each received character string.

本申请还公开了一种样本序列化装置，包括：The application also discloses a sample serialization device, comprising:

字符串提取模块，用于获取待序列化样本中的各个字符串；a string extraction module, configured to acquire each character string in the sample to be serialized;

管理服务器确定模块，用于根据各字符串与各管理服务器之间的对应关系，确定每个字符串对应的管理服务器；a management server determining module, configured to determine, according to a correspondence between each character string and each management server, a management server corresponding to each character string;

字符串发送模块，用于将所述字符串发送至相应的管理服务器，以供各管理服务器根据其维护的映射表，将接收到的字符串转化为相应的序列化ID；其中，不同管理服务器维护的映射表中的字符串互不相同；a string sending module, configured to send the string to a corresponding management server, so that each management server converts the received string into a corresponding serialized ID according to a mapping table maintained by the management server; wherein, different management servers The strings in the maintained mapping table are different from each other;

序列化ID接收模块，用于接收各个管理服务器返回的对应各个字符串的序列化ID；a serialization ID receiving module, configured to receive a serialized ID corresponding to each character string returned by each management server;

样本序列化模块，用于根据接收到的各字符串对应的序列化ID，将各个样本数据中的字符串转换为相应的序列化ID。The sample serialization module is configured to convert the string in each sample data into a corresponding serialization ID according to the serialized ID corresponding to each received string.

字符串接收模块，用于接收字符串；所述字符串由序列化服务器根据字符串与各管理服务器之间的对应关系发送；所述字符串由序列服务器从样本数据中获取；a string receiving module, configured to receive a string; the string is sent by the serialization server according to a correspondence between the string and each management server; the string is obtained by the sequence server from the sample data;

字符串转换模块，用于根据本地维护的映射表，将所接收到的字符串转换为序列化ID；其中，不同管理服务器维护的映射表中的字符串互不相同；a string conversion module, configured to convert the received string into a serialized ID according to a locally maintained mapping table; wherein the strings in the mapping table maintained by different management servers are different from each other;

数字化ID返回模块，用于将所述字符串对应的序列化ID返回给相应的序列化服务器，以供序列化服务器根据接收到的各字符串对应的序列化ID，将各个样本数据中的字符串转换为相应的序列化ID。The digitized ID returning module is configured to return the serialized ID corresponding to the string to the corresponding serialization server, so that the serialization server sets the characters in each sample data according to the serialized ID corresponding to each received string. The string is converted to the corresponding serialized ID.

本申请实施例包括以下优点：Embodiments of the present application include the following advantages:

本申请实施例，将序列化需要的映射表分散到多个管理服务器中，不同的管理服务器的映射表中维护的字符串相互不同，相应字符串的数字化ID也不同。然后序列化服务器只需将待序列化的样本，将其中的字符串根据与各个管理服务器之间的对应关系，发送至相应的管理服务器；然后由该管理服务器去获取该字符串的序列化ID返回给序列化服务器。该序列化ID，如数字ID。因此，序列化服务器即可将样本转换为数字化的样本，以备后续训练使用。如此，序列化服务器不用加载映射表，避免序列化服务器的内存不足。另外，由于映射表分散到了多个管理服务器，管理服务器在查找字符串的序列化ID时，查询的时间短，则可以降低字符串的序列化ID的查询时间，从而可以减少对样本序列化的时间，提高序列化效率。In this embodiment of the present application, the mapping table required for serialization is distributed to multiple management servers. The strings maintained in the mapping table of different management servers are different from each other, and the digitized IDs of the corresponding strings are also different. Then, the serialization server only needs to send the sample to be serialized, and the string is sent to the corresponding management server according to the correspondence relationship with each management server; then the management server obtains the serialization ID of the string. Return to the serialization server. The serialized ID, such as a numeric ID. Therefore, the serialization server can convert the samples into digitized samples for later training. As such, the serialization server does not need to load the mapping table to avoid running out of memory on the serialization server. In addition, since the mapping table is distributed to multiple management servers, when the management server searches for the serialization ID of the string and the query time is short, the query time of the serialized ID of the string can be reduced, thereby reducing the serialization of the sample. Time to improve serialization efficiency.

附图说明DRAWINGS

图1是本申请的从序列化服务器侧描述的一种样本序列化方法实施例的步骤流程图；1 is a flow chart showing the steps of an embodiment of a sample serialization method described on the serialization server side of the present application;

图2是本申请的从管理服务器侧描述的一种样本序列化方法实施例的步骤流程图；2 is a flow chart showing the steps of an embodiment of a sample serialization method described on the management server side of the present application;

图3是本申请的一种样本序列化方法实施例的步骤流程图；3 is a flow chart showing the steps of an embodiment of a sample serialization method of the present application;

图4是本申请的一种样本序列化装置实施例的结构框图；4 is a structural block diagram of an embodiment of a sample serialization apparatus of the present application;

图5是本申请的一种样本序列化装置实施例的结构框图；5 is a structural block diagram of an embodiment of a sample serialization apparatus of the present application;

图6是本申请的一种样本序列化***实施例的结构框图。6 is a structural block diagram of an embodiment of a sample serialization system of the present application.

具体实施方式detailed description

为使本申请的上述目的、特征和优点能够更加明显易懂，下面结合附图和具体实施方式对本申请作进一步详细的说明。The above described objects, features and advantages of the present application will become more apparent and understood.

本申请实施例的核心构思之一在于，将序列化需要的映射表分散到多个管理服务器中，不同的管理服务器的映射表中维护的字符串相互不同，相应字符串的序列化ID也不同。然后序列化服务器只需对于待序列化的样本数据，从该样本数据中提取了字符串后，根据该字符串与各个管理服务器之间的对应关系，将该字符串发送至相应的管理服务器；然后由该管理服务器去获取该字符串的序列化ID返回给序列化服务器。然后，序列化服务器即可将样本转换为数字化的样本，以备后续训练使用。如此，序列化服务器不用加载映射表，避免序列化服务器的内存不足。另外，由于映射表分散到了多个管理服务器，管理服务器在查找字符串的序列化ID时，查询的时间短，则可以降低字符串的序列化ID的查询时间，从而可以减少对样本序列化的时间，提高序列化效率。One of the core concepts of the embodiment of the present application is that the mapping table required for serialization is distributed to multiple management servers, and the strings maintained in the mapping tables of different management servers are different from each other, and the serialization IDs of the corresponding strings are also different. . Then serialize the server just for the order The sample data of the column is extracted from the sample data, and the string is sent to the corresponding management server according to the correspondence between the string and each management server; then the management server obtains the string The serialized ID of the string is returned to the serialization server. The serialization server can then convert the sample into a digitized sample for later training. As such, the serialization server does not need to load the mapping table to avoid running out of memory on the serialization server. In addition, since the mapping table is distributed to multiple management servers, when the management server searches for the serialization ID of the string and the query time is short, the query time of the serialized ID of the string can be reduced, thereby reducing the serialization of the sample. Time to improve serialization efficiency.

实施例一Embodiment 1

参照图1，示出了本申请的一种样本序列化方法实施例的步骤流程图，具体可以包括如下步骤：Referring to FIG. 1 , a flow chart of steps of an embodiment of a sample serialization method of the present application is shown, which may specifically include the following steps:

步骤110，获取待序列化样本中的各个字符串；Step 110: Acquire each character string in the sample to be serialized;

在本申请实施例中，序列化服务器首先接收待序列化的样本数据，在一优选的实施例中，在步骤110之前，还包括：In the embodiment of the present application, the serialization server first receives the sample data to be serialized. In a preferred embodiment, before the step 110, the method further includes:

步骤S100，获取待序列化的各样本数据；Step S100, acquiring each sample data to be serialized;

本申请实施例可以有一个或者多个序列化服务器slave。各序列化服务器可以根据调度服务器coordinator的通知，去获取由该序列化服务器处理的一批样本数据。Embodiments of the present application may have one or more serialization server slaves. Each serialization server may acquire a batch of sample data processed by the serialization server according to the notification of the scheduling server coordinator.

在本申请实施例中，各个序列化服务器和各管理服务器以及调度服务器，可以组成机器训练的训练集群。In the embodiment of the present application, each serialization server and each management server and the scheduling server may constitute a training cluster of machine training.

在本申请另一优选的实施例中，所述获取待序列化的各样本数据的步骤，包括：In another preferred embodiment of the present application, the step of acquiring each sample data to be serialized includes:

子步骤S11，获取由调度服务器对所有样本数据进行平均分配后，属于当前序列化服务器自己的一批样本数据。Sub-step S11, obtaining a batch of sample data belonging to the current serialization server itself after the average distribution of all the sample data by the scheduling server.

比如在训练集群中存在两台序列化服务器，序列化服务器A和序列化服务器B。总共有10000条样本数据，则调度服务器可以将该10000条样本数据分为两份，每份5000条，分别通知序列化服务器A和序列服务器B去获取相应的5000条样本数据。For example, there are two serialization servers in the training cluster, serialization server A and serialization. Server B. A total of 10000 sample data, the scheduling server can divide the 10000 sample data into two, each 5000, respectively, and notify the serialization server A and the sequence server B to obtain the corresponding 5000 sample data.

当然，子步骤S11只是本申请的一种优选的方式，其他分配方式也可以，本申请实施例不对其加以限制。比如根据序列化服务器的性能分配，此时，调度服务器在接收到上传的样本数据后，可以在分配为个序列化服务器分配样本数据之前，先获取序列化服务器的硬件性能，硬件性能在某个区间范围的分配相应比例的样本数据。Of course, the sub-step S11 is only a preferred mode of the present application, and other allocation manners are also possible, and the embodiment of the present application does not limit the same. For example, according to the performance allocation of the serialization server, at this time, after receiving the uploaded sample data, the scheduling server may acquire the hardware performance of the serialization server before the allocation of the sample data to the serialization server, and the hardware performance is somewhere. The interval range is assigned to the corresponding proportion of sample data.

进一步的，在本申请实施例中，每个序列化服务器，在获取到由其序列化的样本数据后，则从样本中提取字符串。比如一条样本如下：Further, in the embodiment of the present application, each serialization server extracts a character string from the sample after acquiring the sample data serialized by the serialization server. For example, a sample is as follows:

labelLabel	特征feature
11	user_id_123,age_1,sex_1,age_comb_city3User_id_123,age_1,sex_1,age_comb_city3

该样本数据一共两列，第一列为label列，表示用户是否点击，值为1代表用户点击，值为0代表用户没有点击；第二列为特征列，值是该条样本数据的所有特征，用逗号分隔。The sample data has two columns in total. The first column is the label column, indicating whether the user clicks. The value 1 indicates the user clicks, the value 0 indicates that the user has no clicks, the second column is the feature column, and the value is all the characteristics of the sample data. , separated by commas.

那么本申请的序列化服务器则从特征列里提取，“user_id_123”,“age_1”,“sex_1”,“age_comb_city3”。Then the serialization server of the present application extracts from the feature column, "user_id_123", "age_1", "sex_1", "age_comb_city3".

可以理解的是，上述示例仅仅是描述提取的字符串的示例，本申请并不受限于此，其他格式的样本数据也可以采用。It can be understood that the above examples are merely examples for describing the extracted character string, and the present application is not limited thereto, and sample data of other formats may also be adopted.

需要说明的是，在本申请实施例中，从样本数据中提取字符串时，提取的是非纯数字化的字符串。比如前述的“user_id_123”,“age_1”,“sex_1”,“age_comb_city3”。如果特征列里的特征有纯数字，则不提取。It should be noted that, in the embodiment of the present application, when a character string is extracted from the sample data, the non-pure digitized character string is extracted. For example, the aforementioned "user_id_123", "age_1", "sex_1", "age_comb_city3". If the feature in the feature column has a pure number, it is not extracted.

在本申请实施例中，可以预先根据样本数据的格式进行分析，确定需要以什么样的方式从样本数据中提取字符串，比如采用什么样的提取模板提取数据等。当然，可以由调度服务器预先确定需要以什么样的方式从样本数据中提取字符串，然后通知给各个序列化服务器。In the embodiment of the present application, the analysis may be performed according to the format of the sample data to determine in which manner the character string needs to be extracted from the sample data, such as what extraction template is adopted. Extract data and so on. Of course, it is possible for the scheduling server to predetermine in what manner the character string needs to be extracted from the sample data and then notify each serialization server.

当然，本申请实施例中，对样本数据进行序列化时，可以一条一条进行序列化，即提取一条样本数据的字符串，发送至相应的各管理服务器，在该条样本数据序列化完之后进行下一条的序列化。也可以批量的进行序列化，即一次发送一批样本数据的字符串，发送至相应的各管理服务器。Certainly, in the embodiment of the present application, when the sample data is serialized, the serialization may be performed one by one, that is, a character string of one sample data is extracted and sent to the corresponding management servers, and after the serialization of the sample data is performed, Serialization of the next one. It can also be serialized in batches, that is, a string of a batch of sample data is sent at a time and sent to the corresponding management servers.

步骤120，根据各字符串与各管理服务器之间的对应关系，确定每个字符串对应的管理服务器；Step 120: Determine, according to a correspondence between each character string and each management server, a management server corresponding to each character string;

本申请实施例的序列化服务器可以将上述提取的字符串发给对应的管理服务器master。在本申请实施例中字符串是属于某个管理服务器的映射表维护的。本申请实施例可以通过某种方式约定字符串与管理服务器的对应关系。The serialization server in the embodiment of the present application may send the extracted character string to the corresponding management server master. In the embodiment of the present application, the character string is maintained by a mapping table belonging to a certain management server. In this embodiment, the correspondence between the character string and the management server may be agreed in some manner.

在本申请一优选的实施例中，所述根据各字符串与各管理服务器之间的对应关系，确定每个字符串对应的管理服务器的步骤，包括：In a preferred embodiment of the present application, the step of determining a management server corresponding to each character string according to the correspondence between each character string and each management server includes:

子步骤S21，将字符串对应的哈希值除以管理服务器的个数，得到余数；Sub-step S21, dividing the hash value corresponding to the string by the number of management servers to obtain a remainder;

子步骤S22，根据余数与管理服务器的对应关系，确定字符串对应的管理服务器。Sub-step S22, the management server corresponding to the character string is determined according to the correspondence between the remainder and the management server.

在本申请实施例中，以前述字符串“user_id_123”为例，计算该字符串的哈希值hash_value，然后以hash_value除以管理服务器的总个数P，取余数，其公式如hash_value％P。In the embodiment of the present application, the hash value hash_value of the string is calculated by taking the string “user_id_123” as an example, and then the hash number is divided by the total number P of the management server, and the remainder is obtained, and the formula is as hash_value%P.

在本申请实施例中，预先设置上述各个余数与管理服务器之间的对应关系。In the embodiment of the present application, the correspondence between each of the foregoing remainders and the management server is set in advance.

比如有2个管理服务器，2其对应的余数为0、1。那么可以先将0对应管理服务器A，1对应管理服务器B。那么hash_value除以2后余数为0的字符串，都发送至管理服务器A；hash_value除以2后余数为1的字符串都发送至管理服务器B。For example, there are two management servers, and the corresponding remainder is 0 and 1. Then, 0 can be associated with the management server A, 1 corresponding to the management server B. Then the hash_value is divided by 2 and the remainder is 0, and the string is sent to the management server A; the hash_value is divided by 2 and the remainder is 1. The strings are sent to the management server B.

在本申请实施例中，为了方便余数与管理服务器之间直接对应，可以将管理服务器的直接按照前述余数进行命名，那么计算得到余数后，直接就可以知道余数是哪个管理服务器。In the embodiment of the present application, in order to facilitate the direct correspondence between the remainder and the management server, the management server may be directly named according to the foregoing remainder, and then after calculating the remainder, it is directly known which management server the remainder is.

本申请另一优选的实施例中，在获取待序列化样本中的各个字符串的步骤之后，还包括：In another preferred embodiment of the present application, after the step of acquiring each character string in the sample to be serialized, the method further includes:

步骤S31，对各个字符串进行去重。In step S31, each character string is deduplicated.

在本申请实施例中，为了降低管理服务器的计算量，以及网络的占用量，可以先将各个字符串进行去重。In the embodiment of the present application, in order to reduce the calculation amount of the management server and the occupancy of the network, each character string may be deduplicated first.

从而每次发送到管理服务器的字符串是唯一的，不会有重复的字符串发送，相应的也不会有重复的序列化ID返回，不会额外占用网络带宽。管理服务器每次收到的字符串也是唯一的，在一次计算中只对该字符串计算一次，不会重复，降低管理服务器计算量。Therefore, each time the string sent to the management server is unique, there will be no repeated string transmission, and no corresponding serialized ID will be returned, and no additional network bandwidth will be occupied. The string received by the management server each time is also unique. In the calculation, only the string is calculated once, and it will not be repeated, which reduces the calculation amount of the management server.

步骤130，将所述字符串发送至相应的管理服务器，以供各管理服务器根据其维护的映射表，将接收到的字符串转化为相应的序列化ID；其中，不同管理服务器维护的映射表中的字符串互不相同；Step 130: Send the character string to a corresponding management server, so that each management server converts the received string into a corresponding serialization ID according to the mapping table maintained by the management server; wherein, the mapping table maintained by different management servers The strings in are different from each other;

在本申请实施例中，各个管理服务器可以预先获取属于该管理服务器维护的字符串，然后构建各个管理服务器自己的映射表。该映射表是字符串与序列化ID的对照表。In the embodiment of the present application, each management server may acquire a character string belonging to the management server in advance, and then construct a mapping table of each management server. The mapping table is a comparison table of strings and serialized IDs.

在本申请实施例中，序列化ID为数字ID，因为在训练过程中，数字最容易带入公式进行计算。In the embodiment of the present application, the serialization ID is a digital ID, because during the training process, the number is most easily brought into the formula for calculation.

在本申请实施例中，对于各个字符串，可以将字符串的哈希值除以所有管理服务器的个数，取其余数，该余数也与管理服务器对应。如前述管理服务器有2个，那么0对应管理服务器A，1对应管理服务器B。然后字符串则可以根据该余数与管理服务器的对应关系，将字符串发送至相应管理服务器。然后该管理服务器可以基于该字符串构建映射表。 In the embodiment of the present application, for each character string, the hash value of the string may be divided by the number of all management servers, and the remaining number is taken, and the remainder also corresponds to the management server. If there are two management servers, then 0 corresponds to the management server A, and 1 corresponds to the management server B. Then the string can send the string to the corresponding management server according to the correspondence between the remainder and the management server. The management server can then build a mapping table based on the string.

在实际应用中，各个序列化服务器在获取其样本之后，先提取所有样本的所有字符串，计算每个字符串的哈希值，将每个字符串的哈希值除以管理服务器的总个数并取余数，然后根据余数与管理服务器的对应关系，将字符串发送至相应的管理服务器。In practical applications, each serialization server extracts all the strings of all samples after obtaining its samples, calculates the hash value of each string, and divides the hash value of each string by the total number of management servers. The number is taken as a remainder, and then the string is sent to the corresponding management server according to the correspondence between the remainder and the management server.

管理服务器则在收到字符串后，对字符串生成序列化ID。然后将字符串与对应的序列化ID构建映射表。The management server then generates a serialized ID for the string after receiving the string. The string is then built into the mapping table with the corresponding serialized ID.

对于管理服务器，在接收到了字符串后，则从本地维护的映射表中查询该字符串的序列化ID，然后将该字符串对应的序列化ID返回给序列化服务器。在实际应用中，管理服务器可以将字符串与其对应的序列化ID一起返回给序列化服务器。For the management server, after receiving the string, the serialized ID of the string is queried from the locally maintained mapping table, and then the serialized ID corresponding to the string is returned to the serialization server. In a practical application, the management server can return the string along with its corresponding serialized ID to the serialization server.

步骤140，接收各个管理服务器返回的对应各个字符串的序列化ID；Step 140: Receive a serialized ID corresponding to each character string returned by each management server;

序列化服务器在发送了样本数据的各字符串后，则可以接收管理服务器返回的上述各字符串对应的序列化ID。After the serialization server transmits each character string of the sample data, it can receive the serialization ID corresponding to each of the above-mentioned character strings returned by the management server.

步骤150，根据接收到的各字符串对应的序列化ID，将各个样本数据中的字符串转换为相应的序列化ID。Step 150: Convert the character string in each sample data into a corresponding serialization ID according to the serialized ID corresponding to each received character string.

序列化服务器接收到各个字符串的序列化ID后，将样本数据中的字符串转换为相应的序列化ID。比如前述“user_id_123”其序列化ID为11，“age_1”的序列化ID为13,“sex_1”的序列化ID为24,“age_comb_city3”的序列化ID为55。那么转换得到的序列化的样本数据为：After receiving the serialization ID of each string, the serialization server converts the string in the sample data into the corresponding serialization ID. For example, the aforementioned "user_id_123" has a serialization ID of 11, the serialization ID of "age_1" is 13, the serialization ID of "sex_1" is 24, and the serialization ID of "age_comb_city3" is 55. Then the serialized sample data obtained by the conversion is:

11	11,13,24,5511,13,24,55

然后，序列化后的样本数据则可以供后续机器训练使用，加快训练速度，提高训练效率。Then, the serialized sample data can be used for subsequent machine training to speed up the training and improve the training efficiency.

本申请实施例中，首先，将序列化需要的映射表分散到多个管理服务器中，不同的管理服务器的映射表中维护的字符串相互不同，相应字符串的数字化ID也不同。完整的映射表分散到了多个管理服务器，管理服务器在查找字符串的序列化ID时，查询的时间短，则可以降低字符串的序列化ID的查询时间，从而可以减少对样本序列化的时间，提高序列化效率。In the embodiment of the present application, first, the mapping table required for serialization is distributed to multiple management servers, and the strings maintained in the mapping table of different management servers are different from each other, and the digitized IDs of the corresponding strings are also different. The complete mapping table is spread across multiple management servers. When the server finds the serialized ID of the string, the query time is short, which can reduce the query time of the serialized ID of the string, thereby reducing the time for serializing the sample and improving the serialization efficiency.

其次，序列化服务器只需将待序列化的样本，将其中的字符串根据与各个管理服务器之间的对应关系，发送至相应的管理服务器；然后由该管理服务器去获取该字符串的序列化ID返回给序列化服务器。如此，序列化服务器本身并不存储序列化所需的完整的映射表，避免序列化服务器的内存不足，并且提高了序列化服务器的性能。Secondly, the serialization server only needs to send the samples to be serialized to the corresponding management server according to the correspondence relationship with each management server; then the management server obtains the serialization of the string. The ID is returned to the serialization server. As such, the serialization server itself does not store the complete mapping table required for serialization, avoiding the memory shortage of the serialization server and improving the performance of the serialization server.

实施例二Embodiment 2

参照图2，示出了本申请的一种样本序列化方法实施例的步骤流程图，具体可以包括如下步骤：Referring to FIG. 2, a flow chart of steps of an embodiment of a sample serialization method of the present application is shown, which may specifically include the following steps:

步骤210，接收字符串；所述字符串由序列化服务器根据字符串与各管理服务器之间的对应关系发送；所述字符串由序列服务器从样本数据中获取；Step 210: Receive a character string; the string is sent by the serialization server according to a correspondence between the character string and each management server; and the string is obtained by the sequence server from the sample data;

在本申请实施例中，各个管理服务器接收某个或者某几个序列化服务器发送的字符串。In this embodiment of the present application, each management server receives a string sent by one or several serialization servers.

在本申请实施例中，在序列化服务器侧对于待序列化的样本数据，则可以从中提取字符串，然后根据字符串与各管理服务器之间的对应关系确定管理服务器，然后将字符串发送至该管理服务器。In the embodiment of the present application, on the serialization server side, for the sample data to be serialized, the character string may be extracted therefrom, and then the management server is determined according to the correspondence between the character string and each management server, and then the character string is sent to The management server.

对于各个序列化服务器而言，其根据字符串与各管理服务器之间的对应关系确定管理服务器，将字符串发送至该管理服务器包括：For each serialization server, it determines a management server according to a correspondence between a character string and each management server, and sending a character string to the management server includes:

子步骤S51，将字符串对应的哈希值除以管理服务器的个数，得到余数；Sub-step S51, dividing the hash value corresponding to the string by the number of management servers to obtain a remainder;

子步骤S52，根据余数与管理服务器的对应关系，确定字符串对应的管理服务器。Sub-step S52, determining a management server corresponding to the character string according to the correspondence between the remainder and the management server.

在本申请一优选的实施例中，可以实时构建各个管理服务器需要维护的映射表，那么在步骤210之前还包括：In a preferred embodiment of the present application, each management server can be constructed in real time and needs to be maintained. The mapping table, then before step 210, also includes:

步骤S201，获取属于当前管理服务器自己的一批字符串；其中，属于当前管理服务器的一批字符串与属于其他管理服务器的字符串不同；Step S201: Acquire a batch of strings belonging to the current management server; wherein, the batch of strings belonging to the current management server is different from the strings belonging to other management servers;

在本申请实施例中，可以设置多个管理服务器，那么对于每个管理服务器，可以获取属于自己的一批字符串，不同的管理服务器获取的字符串相互之间互不相同。In the embodiment of the present application, a plurality of management servers may be set. For each management server, a batch of strings belonging to oneself may be acquired, and the strings acquired by different management servers are different from each other.

在本申请实施例中，各个管理服务器可以预先获取属于该管理服务器维护的字符串，然后构建各个管理服务器自己的映射表。In the embodiment of the present application, each management server may acquire a character string belonging to the management server in advance, and then construct a mapping table of each management server.

在本申请实施例中，对于各个字符串，可以将字符串的哈希值除以所有管理服务器的个数，取其余数，该余数也与管理服务器对应。如前述管理服务器有2个，那么0对应管理服务器A，1对应管理服务器B。然后字符串则可以根据该余数与管理服务器的对应关系，将字符串发送至相应管理服务器。然后该管理服务器可以基于该字符串构建映射表。In the embodiment of the present application, for each character string, the hash value of the string may be divided by the number of all management servers, and the remaining number is taken, and the remainder also corresponds to the management server. If there are two management servers, then 0 corresponds to the management server A, and 1 corresponds to the management server B. Then the string can send the string to the corresponding management server according to the correspondence between the remainder and the management server. The management server can then build a mapping table based on the string.

其中，属于当前管理服务器的一批字符串所对应的余数属于当前管理服务器；所述余数为所述字符串对应的哈希值除以各个管理服务器的个数得到。The remainder corresponding to the batch of the character string belonging to the current management server belongs to the current management server; the remainder is obtained by dividing the hash value corresponding to the character string by the number of each management server.

步骤S202，将所述字符串进行序列化，并构建字符串与序列化ID的映射表；Step S202, serializing the character string, and constructing a mapping table of a string and a serialized ID;

优选的，所述将所述字符串进行序列化，并构建字符串与序列化ID的映射表的步骤，包括： Preferably, the step of serializing the character string and constructing a mapping table of a string and a serialized ID comprises:

子步骤S41，获取当前管理服务器的排序之前的各个管理服务器中的字符串的第一总数量N1；Sub-step S41, obtaining a first total number N1 of character strings in each management server before sorting of the current management server;

比如，管理服务器有A、B、C，其顺序也如A、B、C排序。对于第一个管理服务器A，其有110个字符串；对于第二个管理服务器B，其有90个字符串，对于第三个管理服务器，其有100个字符串。For example, the management server has A, B, and C, and its order is also sorted as A, B, and C. For the first management server A, there are 110 strings; for the second management server B, there are 90 strings, and for the third management server, there are 100 strings.

那么管理服务器A之前的各个管理服务器中的字符串的第一总数量N1＝0。Then the first total number of strings in the respective management servers before the management server A is N1=0.

管理服务器B之前，有管理服务器A，其第一总数量N1＝110。Before managing the server B, there is a management server A whose first total number N1=110.

管理服务器C之前，有管理服务器A和管理服务器B，其第一总数量N1＝200。Before managing server C, there are management server A and management server B, the first total number of which is N1=200.

子步骤S42，以所述第一总数量N1加上当前管理服务器的字符串的数量M得到第二总数量N2；Sub-step S42, the second total number N2 is obtained by adding the first total number N1 plus the number M of the character string of the current management server;

子步骤S43，以[N1+1,N2]作为当前管理服务器对字符串序列化的范围。Sub-step S43, [N1+1, N2] is used as the range in which the current management server serializes the character string.

管理服务器A的字符串数量M＝110，那么管理服务器A的字符串序列化范围为[1,110],那么对于管理服务器A中的字符串，可以按序将其对应1到110的序列化ID。The number of strings of the management server A is M=110, and the string serialization range of the management server A is [1, 110], and for the character string in the management server A, it can be sequentially assigned to the serialized ID of 1 to 110.

管理服务器B的字符串数量是90，那么管理服务器B的字符串序列化范围为[111,200]，那么对于管理服务器B中的字符串，可以按序将其对应111到200的序列化ID。The number of strings of the management server B is 90, and the string serialization range of the management server B is [111, 200], and for the character string in the management server B, it can be sequentially assigned to the serialized ID of 111 to 200.

管理服务器C的字符串数量是100，那么管理服务器B的字符串序列化范围为[201,300]，那么对于管理服务器C中的字符串，可以按序将其对应201到300的序列化ID。The number of strings of the management server C is 100, and the string serialization range of the management server B is [201, 300], and for the character string in the management server C, it can be sequentially assigned to the serialized ID of 201 to 300.

步骤220，根据本地维护的映射表，将所接收到的字符串转换为序列化ID；其中，不同管理服务器维护的映射表中的字符串互不相同；Step 220: Convert the received string into a serialized ID according to the locally maintained mapping table; wherein the strings in the mapping table maintained by different management servers are different from each other;

对于一个管理服务器来说，由于其维护了一个映射表，该映射表中有字符串与其对应的序列化ID，由于其接收到的字符串是属于该管理服务器维护的，因此该管理服务器可以根据其本地维护的映射表，将所接收到的字符串转换为序列化ID。比如根据映射表中的字符串与数字ID的对应关系，查找字符串对应的数字ID，然后将查找到的数字ID返回给相应序列化服务器。For a management server, since it maintains a mapping table, there is The string and its corresponding serialization ID, because the string it receives belongs to the management server, the management server can convert the received string into a serialized ID according to its locally maintained mapping table. For example, according to the correspondence between the string and the digital ID in the mapping table, the digital ID corresponding to the string is searched, and then the found digital ID is returned to the corresponding serialization server.

在本申请另一优选的实施例中，所述根据本地维护的映射表，将所接收到的字符串转换为序列化ID的步骤包括：In another preferred embodiment of the present application, the step of converting the received string into a serialized ID according to the locally maintained mapping table includes:

子步骤S61，查询本地维护的映射表中是否有所述字符串；Sub-step S61, querying whether the string is in the locally maintained mapping table;

子步骤S62，如果本地维护的映射表中有所述字符串，则获取该字符串对应的序列化ID；Sub-step S62, if the string is in the locally maintained mapping table, the serialization ID corresponding to the string is obtained;

子步骤S63，如果本地维护的映射表中没有所述字符串，则针对所述字符串生成序列化ID，并将所述字符串以及相应序列化ID加入映射表。Sub-step S63, if the character string is not included in the locally maintained mapping table, a serialization ID is generated for the character string, and the character string and the corresponding serialization ID are added to the mapping table.

在本申请实施例中，序列化服务器获取的样本中可能存在管理服务器的映射表中未记录的字符串，对于该种情况，管理服务器可以为其生成一个序列化ID，然后将字符串与序列化ID记录到映射表中。同时，将该字符串对应的序列化ID返回给相应的序列化服务器。In the embodiment of the present application, the sample obtained by the serialization server may have an unrecorded string in the mapping table of the management server. For this case, the management server may generate a serialized ID for it, and then the string and the sequence. The ID is recorded in the mapping table. At the same time, the serialized ID corresponding to the string is returned to the corresponding serialization server.

在实际应用中，可以为各个监控服务器预先划定相互不重叠的序列化范围，管理服务器可以为该字符串分配序列化范围中的序列化ID，如果其序列化范围分配完毕，则可以再分配一个唯一的序列化范围。In an actual application, the serialization range that does not overlap each other may be pre-defined for each monitoring server, and the management server may allocate the serialization ID in the serialization range for the string, and may be redistributed if the serialization range is allocated. A unique serialization range.

步骤230，将所述字符串对应的序列化ID返回给相应的序列化服务器，以供序列化服务器根据接收到的各字符串对应的序列化ID，将各个样本数据中的字符串转换为相应的序列化ID。Step 230: Return the serialized ID corresponding to the character string to the corresponding serialization server, so that the serialization server converts the string in each sample data into corresponding according to the serialized ID corresponding to each received string. Serialized ID.

当然，在本申请实施例中，管理服务器在接收到字符串后，可以记录该字符串由那个序列化服务器发送，那么在查找到该字符串对应的序列化ID后，可以根据记录，将相应的字符串与序列化ID返回相应序列化服务器。 Of course, in the embodiment of the present application, after receiving the character string, the management server may record the string to be sent by the serialization server, and after finding the serialization ID corresponding to the string, according to the record, the corresponding The string with the serialized ID is returned to the corresponding serialization server.

本申请实施例中，首先，将序列化需要的映射表分散到多个管理服务器中，不同的管理服务器的映射表中维护的字符串相互不同，相应字符串的数字化ID也不同。完整的映射表分散到了多个管理服务器，管理服务器在查找字符串的序列化ID时，查询的时间短，则可以降低字符串的序列化ID的查询时间，从而可以减少对样本序列化的时间，提高序列化效率。In the embodiment of the present application, first, the mapping table required for serialization is distributed to multiple management servers, and the strings maintained in the mapping table of different management servers are different from each other, and the digitized IDs of the corresponding strings are also different. The complete mapping table is distributed to multiple management servers. When the management server searches for the serialization ID of the string, the query time is short, which can reduce the query time of the serialized ID of the string, thereby reducing the time for serializing the sample. , improve serialization efficiency.

实施例三Embodiment 3

参照图3，示出了本申请优选的一种样本序列化方法实施例的步骤流程图。Referring to Figure 3, there is shown a flow chart of the steps of an embodiment of a preferred sample serialization method of the present application.

本实施例为了更清楚的描述序列化方法，以调度服务器、序列化服务器、管理服务器整体架构的角度进行描述。In this embodiment, in order to describe the serialization method more clearly, the description is made from the perspective of the overall configuration of the scheduling server, the serialization server, and the management server.

在本申请实施例中，可以结合调度服务器、序列化服务器对各个管理服务器创建映射表。如步骤S30-步骤S38。In the embodiment of the present application, a mapping table may be created for each management server by using a scheduling server and a serialization server. Steps S30 - S38.

步骤S32，调度服务器对所有样本数据进行平均分配，并根据分配结果通知各序列化服务器获取属于各序列化服务器自己的一批样本数据。In step S32, the scheduling server evenly distributes all the sample data, and notifies each serialization server to acquire a batch of sample data belonging to each serialization server according to the allocation result.

在整个训练开始之前，调度服务器获取到所有样本数据的标识信息后，可以对所有样本数据进行平均分配。比如根据样本数据的序列号将样本数据平均分配给N个序列服务器。调度服务器将分配结构通知给各个序列化服务器，以使各个序列化服务器去获取属于自己的样本数据。同时，调度服务器通知序列化服务器执行字符串序列化的过程，使其先不对样本数据进行序列化操作，因为此时管理服务器没有映射表。All the sample data can be evenly distributed after the scheduling server obtains the identification information of all the sample data before the entire training starts. For example, the sample data is evenly distributed to N sequence servers according to the serial number of the sample data. The dispatch server notifies the serialization server of the allocation structure so that each serialization server can obtain its own sample data. At the same time, the dispatch server notifies the serialization server to perform the string serialization process so that the sample data is not serialized first because the management server does not have a mapping table at this time.

步骤S34，每个序列化服务器根据调度服务器的通知，获取属于自己的一批样本数据，并所述样本数据中所有的字符串整合发送至管理服务器。Step S34, each serialization server acquires the self according to the notification of the scheduling server. A batch of sample data, and all the string data in the sample data is sent to the management server.

在实际应用中，各台序列化服务器获取到前述第一次均分的样本数据后，可以从这些样本数据中按照预先配置的提取规则，从中提取该批数据的所有字符串，然后对这些字符串进行去重，再将去重后的字符串，按照发送规则发送至各管理服务器。该发送规则包括：将字符串对应的哈希值除以管理服务器的总个数，得到余数，如；根据余数与管理服务器的对应关系，将各字符串发送至余数相应的管理服务器中。In an actual application, after each serialization server obtains the first-time averaged sample data, all the strings of the batch of data may be extracted from the sample data according to a pre-configured extraction rule, and then the characters are The string is deduplicated, and the deduplicated character string is sent to each management server according to the sending rule. The sending rule includes: dividing the hash value corresponding to the string by the total number of management servers to obtain a remainder, for example, sending each character string to the corresponding management server according to the correspondence between the remainder and the management server.

步骤S36，管理服务器接收各序列化服务器发送的字符串；Step S36: The management server receives the character string sent by each serialization server;

步骤S38，管理服务器在接收完属于该管理服务器的所有字符串后，将所述字符串进行序列化，并构建字符串与序列化ID的映射表。Step S38: After receiving all the strings belonging to the management server, the management server serializes the character string and constructs a mapping table of the string and the serialized ID.

在本申请实施例中，各序列化服务器可以通过网络连接将字符串发送至管理服务器，字符串发送完毕后，可以断开与相应管理服务器的网络连接。那么管理服务器则可以通过网络连接的中断，判断该序列化服务器是否发送完毕其字符串。当管理服务器判断所有序列化服务器发送完字符串后，则可以将所述字符串进行序列化，并构建字符串与序列化ID的映射表。In the embodiment of the present application, each serialization server may send a character string to the management server through a network connection, and after the string is sent, the network connection with the corresponding management server may be disconnected. Then the management server can determine whether the serialization server has sent its string through the interruption of the network connection. After the management server determines that all serialization servers have sent the string, the string can be serialized and a mapping table of strings and serialized IDs can be constructed.

当然，实际应用中，管理服务器还可以采用其他方式确定其接收完属于其自身的所有字符串。比如预先约定一个完毕标识，序列化服务器在其字符串发送完毕之后，向各管理服务器发送该完毕标识，然后各管理服务器则记录该序列化服务器的完毕标识，当接收到所有序列化服务器的完毕标识之后，则确定管理服务器接收完属于该管理服务器的所有字符串。具体的方法，本申请实施例不对其加以限制。Of course, in practical applications, the management server can also determine that it has received all the strings belonging to itself. For example, a completion identifier is pre-agreed, and the serialization server sends the completion identifier to each management server after the string is sent, and then each management server records the completion identifier of the serialization server, and when all the serialization servers are received, After the identification, it is determined that the management server has received all the strings belonging to the management server. The specific method is not limited in the embodiment of the present application.

在管理服务器构建完毕了前述映射表之后，调度服务器可以再协调各个序列化服务器执行对样本数据的序列化操作。如步骤310-332。After the management server has built the aforementioned mapping table, the scheduling server can coordinate the serialization operations performed by the serialization servers on the sample data. Follow steps 310-332.

步骤310，调度服务器通知各序列化服务器获取属于自己的样本数据； Step 310: The scheduling server notifies each serialization server to acquire sample data belonging to itself;

对于每个序列化服务器，执行以下步骤：For each serialization server, perform the following steps:

步骤312，根据所述通知，读取样本数据；Step 312: Read sample data according to the notification.

步骤314，从样本数据中提取各个字符串； Step 314, extracting each character string from the sample data;

当然，实际应用中，对于提取的字符串，还会对其进行去重，然后执行步骤316。Of course, in the actual application, for the extracted string, it will be deduplicated, and then step 316 is performed.

步骤316，对各字符串，将字符串对应的哈希值除以管理服务器的个数，得到余数；Step 316: For each character string, divide the hash value corresponding to the string by the number of management servers to obtain a remainder;

步骤318，根据所述余数与管理服务器的对应关系，确定字符串对应的管理服务器；Step 318: Determine, according to the correspondence between the remainder and the management server, a management server corresponding to the character string;

步骤320，将所述字符串发送至相应的管理服务器。 Step 320, sending the character string to a corresponding management server.

对于管理服务器，则执行以下步骤：For the management server, perform the following steps:

步骤322，接收字符串；Step 322: Receive a character string.

接收步骤320中序列化服务器发送的字符串。The string sent by the serialization server in step 320 is received.

步骤324，根据本地维护的映射表，将所接收到的字符串转换为序列化ID。Step 324: Convert the received string into a serialized ID according to the locally maintained mapping table.

该映射表已经在步骤S32-S38中构建。This mapping table has been constructed in steps S32-S38.

步骤326，将所述字符串对应的序列化ID返回给相应的序列化服务器。Step 326: Return the serialized ID corresponding to the string to the corresponding serialization server.

之后，对于每个序列化服务器，再执行以下步骤：Then, for each serialization server, perform the following steps:

步骤328，接收各个管理服务器返回的对应各个字符串的序列化ID； Step 328, receiving a serialization ID corresponding to each character string returned by each management server;

步骤330，根据接收到的各字符串对应的序列化ID，将各个样本数据中的字符串转换为相应的序列化ID。Step 330: Convert the character string in each sample data into a corresponding serialization ID according to the received serialization ID corresponding to each character string.

步骤332，输出序列化的样本数据。 Step 332, output serialized sample data.

如此，序列化后的样本数据则可以供给机器训练使用。In this way, the serialized sample data can be used for machine training.

本申请实施例中，In the embodiment of the present application,

首先，将序列化需要的映射表分散到多个管理服务器中，不同的管理服务器的映射表中维护的字符串相互不同，相应字符串的数字化ID也不同。完整的映射表分散到了多个管理服务器，管理服务器在查找字符串的序列化ID时，查询的时间短，则可以降低字符串的序列化ID的查询时间，从而可以减少对样本序列化的时间，提高序列化效率。First, the mapping table required for serialization is distributed to multiple management servers. The strings maintained in the mapping tables of different management servers are different from each other, and the digitized IDs of the corresponding strings are also different. The complete mapping table is distributed to multiple management servers. When the management server searches for the serialization ID of the string, the query time is short, which can reduce the query time of the serialized ID of the string, thereby reducing the time for serializing the sample. , improve serialization efficiency.

最后，序列化服务器只需将待序列化的样本，将其中的字符串根据与各个管理服务器之间的对应关系，发送至相应的管理服务器；然后由该管理服务器去获取该字符串的序列化ID返回给序列化服务器。如此，序列化服务器本身并不存储序列化所需的完整的映射表，避免序列化服务器的内存不足，并且提高了序列化服务器的性能。Finally, the serialization server only needs to send the samples to be serialized, according to the correspondence between the strings and the respective management servers, to the corresponding management server; then the management server obtains the serialization of the string. The ID is returned to the serialization server. As such, the serialization server itself does not store the complete mapping table required for serialization, avoiding the memory shortage of the serialization server and improving the performance of the serialization server.

然后，在结合步骤S32-S38的过程，在映射表的构建过程中，所有样本的字符串分散到的多个序列化服务器进行提取，提取速度快，使映射表的构建速度加快。其次，映射表的构建分散到了多个管理服务器中，其每个管理服务器不用构建完整的映射表，而只需构建部分的映射表，映射表构建速度加快。再次，构建映射表的位置变化为管理服务器，传统的进行序列化的序列化服务器不用进行映射表的构建过程，也不用存储映射表，减轻了序列化服务器的负担。Then, in the process of combining the steps S32-S38, in the construction process of the mapping table, the serialization servers of all the samples are distributed to the plurality of serialization servers for extraction, and the extraction speed is fast, so that the construction speed of the mapping table is accelerated. Secondly, the construction of the mapping table is distributed to multiple management servers. Each management server does not need to build a complete mapping table, but only needs to build part of the mapping table, and the mapping table is built faster. Again, the location of the mapping table is changed to the management server. The traditional serialization serialization server does not need to perform the mapping table construction process, nor does it need to store the mapping table, thereby reducing the burden on the serialization server.

需要说明的是，对于方法实施例，为了简单描述，故将其都表述为一系列的动作组合，但是本领域技术人员应该知悉，本申请实施例并不受所描述的动作顺序的限制，因为依据本申请实施例，某些步骤可以采用其他顺序或者同时进行。其次，本领域技术人员也应该知悉，说明书中所描述的实施例均属于优选实施例，所涉及的动作并不一定是本申请实施例所必须的。 It should be noted that, for the method embodiments, for the sake of simple description, they are all expressed as a series of action combinations, but those skilled in the art should understand that the embodiments of the present application are not limited by the described action sequence, because In accordance with embodiments of the present application, certain steps may be performed in other sequences or concurrently. In the following, those skilled in the art should also understand that the embodiments described in the specification are all preferred embodiments, and the actions involved are not necessarily required in the embodiments of the present application.

实施例四Embodiment 4

参照图4，示出了本申请的一种样本序列化装置实施例的结构框图，具体可以包括如下模块：Referring to FIG. 4, a structural block diagram of an embodiment of a sample serialization apparatus of the present application is shown, which may specifically include the following modules:

字符串提取模块410，用于获取待序列化样本中的各个字符串；a string extraction module 410, configured to acquire each character string in the sample to be serialized;

其中，在字符串提取模块410之前还包括：The string extraction module 410 further includes:

样本数据获取模块S400，用于获取待序列化的各样本数据；a sample data obtaining module S400, configured to acquire each sample data to be serialized;

管理服务器确定模块420，用于根据各字符串与各管理服务器之间的对应关系，确定每个字符串对应的管理服务器；a management server determining module 420, configured to determine, according to a correspondence between each character string and each management server, a management server corresponding to each character string;

字符串发送模块430，用于将所述字符串发送至相应的管理服务器，以供各管理服务器根据其维护的映射表，将接收到的字符串转化为相应的序列化ID；其中，不同管理服务器维护的映射表中的字符串互不相同；The string sending module 430 is configured to send the string to the corresponding management server, so that each management server converts the received string into a corresponding serialized ID according to the mapping table maintained by the management server; wherein, different management The strings in the mapping table maintained by the server are different from each other;

序列化ID接收模块440，用于接收各个管理服务器返回的对应各个字符串的序列化ID；The serialization ID receiving module 440 is configured to receive a serialized ID corresponding to each character string returned by each management server;

样本序列化模块450，用于根据接收到的各字符串对应的序列化ID，将各个样本数据中的字符串转换为相应的序列化ID。The sample serialization module 450 is configured to convert the character string in each sample data into a corresponding serialization ID according to the received serialization ID corresponding to each character string.

在本申请另一优选的实施例中，所述管理服务器确定模块420包括：In another preferred embodiment of the present application, the management server determining module 420 includes:

字符串取余模块，用于将字符串对应的哈希值除以管理服务器的个数，得到余数；a string remainder module, which is used to divide the hash value corresponding to the string by the number of management servers to obtain a remainder;

第一余数确定模块，用于根据余数与管理服务器的对应关系，确定字符串对应的管理服务器。The first remainder determining module is configured to determine a management server corresponding to the character string according to the correspondence between the remainder and the management server.

在本申请另一优选的实施例中，在字符串提取模块410之后，还包括：In another preferred embodiment of the present application, after the string extraction module 410, the method further includes:

去重模块，用于对各个字符串进行去重。De-duplication module for de-duplicating individual strings.

在本申请另一优选的实施例中，所述符串提取模块410之前包括： In another preferred embodiment of the present application, the string extraction module 410 previously includes:

第一样本数据获取模块，用于获取由调度服务器对所有样本数据进行平均分配后，属于当前序列化服务器自己的一批样本数据。The first sample data obtaining module is configured to acquire a batch of sample data belonging to the current serialization server itself after the average distribution of all the sample data by the scheduling server.

本实施例可以应用于序列化服务器侧。This embodiment can be applied to the serialization server side.

实施例五Embodiment 5

参照图5，示出了本申请的另一种样本序列化装置实施例的结构框图，具体可以包括如下模块：Referring to FIG. 5, a structural block diagram of another embodiment of a sample serialization apparatus of the present application is shown, which may specifically include the following modules:

字符串接收模块510，用于接收字符串；所述字符串由序列化服务器根据字符串与各管理服务器之间的对应关系发送；所述字符串由序列服务器从样本数据中获取；a string receiving module 510, configured to receive a string; the string is sent by the serialization server according to a correspondence between the string and each management server; the string is obtained by the sequence server from the sample data;

字符串转换模块520，用于根据本地维护的映射表，将所接收到的字符串转换为序列化ID；其中，不同管理服务器维护的映射表中的字符串互不相同；The string conversion module 520 is configured to convert the received string into a serialized ID according to a locally maintained mapping table; wherein the strings in the mapping table maintained by different management servers are different from each other;

数字化ID返回模块530，用于将所述字符串对应的序列化ID返回给相应的序列化服务器，以供序列化服务器根据接收到的各字符串对应的序列化ID，将各个样本数据中的字符串转换为相应的序列化ID。The digitized ID returning module 530 is configured to return the serialized ID corresponding to the character string to the corresponding serialization server, for the serialization server to use the sequence corresponding to each received string. Columnize the ID to convert the string in each sample data to the corresponding serialized ID.

本实施例可以应用于管理服务器侧。This embodiment can be applied to the management server side.

在本申请一优选的实施例中，所述字符串接收模块510之前包括：In a preferred embodiment of the present application, the string receiving module 510 previously includes:

字符串获取模块，用于获取属于当前管理服务器自己的一批字符串；其中，属于当前管理服务器的一批字符串与属于其他管理服务器的字符串不同；a string obtaining module, configured to acquire a batch of strings belonging to the current management server; wherein, the batch of strings belonging to the current management server is different from the strings belonging to other management servers;

映射表构建模块，用于将所述字符串进行序列化，并构建字符串与序列化ID的映射表。A mapping table building block for serializing the string and constructing a mapping table of strings and serialized IDs.

在本申请另一优选的实施例中，所述映射表构建模块包括：In another preferred embodiment of the present application, the mapping table construction module includes:

第一数量获取模块，用于获取当前管理服务器的排序之前的各个管理服务器中的字符串的第一总数量N1；a first quantity obtaining module, configured to acquire a first total number N1 of character strings in each management server before sorting of the current management server;

第二数量获取模块，用于以所述第一总数量N1加上当前管理服务器的字符串的数量M得到第二总数量N2；a second quantity obtaining module, configured to obtain a second total number N2 by using the first total number N1 plus the number M of strings of the current management server;

序列化范围确定模块，用于以[N1+1,N2]作为当前管理服务器对字符串序列化的范围。The serialization range determining module is configured to use [N1+1, N2] as the range of serialization of the string by the current management server.

在本申请另一优选的实施例中，所述字符串转换模块包括：In another preferred embodiment of the present application, the string conversion module includes:

查询模块，用于查询本地维护的映射表中是否有所述字符串；a query module, configured to query whether the string is in the locally maintained mapping table;

第一数字化ID获取模块，用于如果本地维护的映射表中有所述字符串，则获取该字符串对应的序列化ID；a first digitized ID obtaining module, configured to obtain a serialized ID corresponding to the string if the string is in the locally maintained mapping table;

生成模块，用于如果本地维护的映射表中没有所述字符串，则针对所述字符串生成序列化ID，并将所述字符串以及相应序列化ID加入映射表。And a generating module, configured to generate a serialization ID for the string if the string is not in the locally maintained mapping table, and add the string and the corresponding serialization ID to the mapping table.

在本申请另一优选的实施例中，所述属于当前管理服务器的一批字符串包括：In another preferred embodiment of the present application, the batch of strings belonging to the current management server includes:

所述一批字符串中各字符串所对应的余数属于当前管理服务器；所述余数为所述字符串对应的哈希值除以各个管理服务器的个数得到。The remainder corresponding to each character string in the batch of strings belongs to the current management server; The remainder is obtained by dividing the hash value corresponding to the character string by the number of each management server.

实施例六Embodiment 6

参照图6，示出了本申请的另一种样本序列化***实施例的结构框图，具体可以包括如下模块：Referring to FIG. 6, a structural block diagram of another embodiment of a sample serialization system of the present application is shown, which may specifically include the following modules:

调度服务器600，多个序列化服务器700，多个管理服务器800。图中仅仅示出了3个序列化服务器700和3个管理服务器800，各种服务器的数量可以根据实际需求设置。The scheduling server 600, the plurality of serialization servers 700, and the plurality of management servers 800. Only three serialization servers 700 and three management servers 800 are shown in the figure, and the number of various servers can be set according to actual needs.

其中，调度服务器600包括：The scheduling server 600 includes:

通知模块601，用于调度服务器通知各序列化服务器获取属于自己的样本数据；The notification module 601 is configured to: the scheduling server notifies each serialization server to acquire sample data belonging to itself;

在本申请优选的实施例中，在实际应用中，调度服务器600还包括：平均分配模块，用于对所有样本数据进行平均分配，并根据分配结果通知各序列化服务器获取属于各序列化服务器自己的一批样本数据。In a preferred embodiment of the present application, in an actual application, the scheduling server 600 further includes: an average allocation module, configured to perform an average allocation on all sample data, and notify each serialization server to acquire the serialization server itself according to the distribution result. A batch of sample data.

调度服务器600在整个训练开始之前，该通知模块还用于通知序列化服务器执行字符串序列化的过程，使其先不对样本数据进行序列化操作，因为此时管理服务器没有映射表。The scheduling server 600 is further configured to notify the serialization server to perform a string serialization process before the entire training starts, so that the sample data is not serialized first. Because the management server does not have a mapping table at this time.

其中，每个序列化服务器700包括：Wherein each serialization server 700 includes:

样本获取模块701，用于根据所述通知，读取样本数据；The sample obtaining module 701 is configured to read sample data according to the notification;

字符串提取模块702，用于从样本数据中提取各个字符串；a string extraction module 702, configured to extract each character string from the sample data;

当然，实际应用中，字符串提取模块702还用于对于提取的字符串，还会对其进行去重，然后进入。Of course, in practical applications, the string extraction module 702 is also used to extract the extracted character string, and then de-duplicate it and then enter.

字符串取余模块703，用于对各字符串，将字符串对应的哈希值除以管理服务器的个数，得到余数；The string remainder module 703 is configured to divide the hash value corresponding to the string by the number of the management server for each character string to obtain a remainder;

第一余数确定模块704，用于根据所述余数与管理服务器的对应关系，确定字符串对应的管理服务器。The first remainder determining module 704 is configured to determine, according to the correspondence between the remainder and the management server, a management server corresponding to the character string.

字符串发送模块705，用于将所述字符串发送至相应的管理服务器a string sending module 705, configured to send the string to a corresponding management server

序列化ID接收模块706，用于接收各个管理服务器返回的对应各个字符串的序列化ID；The serialization ID receiving module 706 is configured to receive a serialized ID corresponding to each character string returned by each management server;

样本序列化模块707，用于根据接收到的各字符串对应的序列化ID，将各个样本数据中的字符串转换为相应的序列化ID。The sample serialization module 707 is configured to convert the character string in each sample data into a corresponding serialization ID according to the received serialization ID corresponding to each character string.

输出模块708，用于输出序列化的样本数据。The output module 708 is configured to output serialized sample data.

在本申请另一实施例中，为了给管理服务器创建映射表提供支持，序列化服务器700包括：In another embodiment of the present application, in order to provide support for creating a mapping table for the management server, the serialization server 700 includes:

整合发送模块，用于每个序列化服务器根据调度服务器的通知，获取属于自己的一批样本数据，并所述样本数据中所有的字符串整合发送至管理服务器。The integrated sending module is configured to obtain a batch of sample data belonging to itself according to the notification of the dispatching server, and all the string data in the sample data is integrated and sent to the management server.

每个管理服务器800包括:Each management server 800 includes:

字符串接收模块801，用于接收字符串；a string receiving module 801, configured to receive a character string;

接收字符串发送模块705发送的字符串。 The character string sent by the string sending module 705 is received.

字符串转换模块802，用于根据本地维护的映射表，将所接收到的字符串转换为序列化ID；a string conversion module 802, configured to convert the received string into a serialized ID according to a locally maintained mapping table;

数字化ID返回模块803，用于将所述字符串对应的序列化ID返回给相应的序列化服务器，The digitized ID returning module 803 is configured to return the serialized ID corresponding to the character string to the corresponding serialization server.

在本申请另一实施例中，管理服务器800还通过以下模块创建映射表：In another embodiment of the present application, the management server 800 also creates a mapping table through the following modules:

该字符串获取模块获取的字符串可以由序列化服务器的整合发送模块中获得字符串。The string obtained by the string acquisition module can be obtained by the serialization server's integrated sending module.

其次，序列化服务器只需将待序列化的样本，将其中的字符串根据与各个管理服务器之间的对应关系，发送至相应的管理服务器；然后由该管理服务器去获取该字符串的序列化ID返回给序列化服务器。如此，序列化服务器本身并不存储序列化所需的完整的映射表，避免序列化服务器的内存不足，并且提高了序列化服务器的性能。 Secondly, the serialization server only needs to send the samples to be serialized to the corresponding management server according to the correspondence relationship with each management server; then the management server obtains the serialization of the string. The ID is returned to the serialization server. As such, the serialization server itself does not store the complete mapping table required for serialization, avoiding the memory shortage of the serialization server and improving the performance of the serialization server.

对于装置实施例而言，由于其与方法实施例基本相似，所以描述的比较简单，相关之处参见方法实施例的部分说明即可。For the device embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and the relevant parts can be referred to the description of the method embodiment.

本说明书中的各个实施例均采用递进的方式描述，每个实施例重点说明的都是与其他实施例的不同之处，各个实施例之间相同相似的部分互相参见即可。The various embodiments in the present specification are described in a progressive manner, and each embodiment focuses on differences from other embodiments, and the same similar parts between the various embodiments can be referred to each other.

本领域内的技术人员应明白，本申请实施例的实施例可提供为方法、装置、或计算机程序产品。因此，本申请实施例可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且，本申请实施例可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art will appreciate that embodiments of the embodiments of the present application can be provided as a method, apparatus, or computer program product. Therefore, the embodiments of the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware. Moreover, embodiments of the present application can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) including computer usable program code.

在一个典型的配置中，所述计算机设备包括一个或多个处理器(CPU)、输入/输出接口、网络接口和内存。内存可能包括计算机可读介质中的非永久性存储器，随机存取存储器(RAM)和/或非易失性内存等形式，如只读存储器(ROM)或闪存(flash RAM)。内存是计算机可读介质的示例。计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括，但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带，磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质，可用于存储可以被计算设备访问的信息。按照本文中的界定，计算机可读介质不包括非持续性的电脑可读媒体(transitory media)，如调制的数据信号和载波。In a typical configuration, the computer device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory. The memory may include non-persistent memory, random access memory (RAM), and/or non-volatile memory in a computer readable medium, such as read only memory (ROM) or flash memory. Memory is an example of a computer readable medium. Computer readable media includes both permanent and non-persistent, removable and non-removable media. Information storage can be implemented by any method or technology. The information can be computer readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory. (ROM), electrically erasable programmable read only memory (EEPROM), flash memory or other memory technology, compact disk read only memory (CD-ROM), digital versatile disk (DVD) or other optical storage, Magnetic tape cartridges, magnetic tape storage or other magnetic storage devices or any other non-transportable media can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-persistent computer readable media, such as modulated data signals and carrier waves.

本申请实施例是参照根据本申请实施例的方法、终端设备(***)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理终端设备的处理器以产生一个机器，使得通过计算机或其他可编程数据处理终端设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The embodiments of the present application refer to a method, a terminal device (system), and a method according to an embodiment of the present application. A flowchart and/or block diagram of a computer program product is described. It will be understood that each flow and/or block of the flowchart illustrations and/or FIG. These computer program instructions can be provided to a processor of a general purpose computer, special purpose computer, embedded processor or other programmable data processing terminal device to produce a machine such that instructions are executed by a processor of a computer or other programmable data processing terminal device Means are provided for implementing the functions specified in one or more of the flow or in one or more blocks of the flow chart.

这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理终端设备以特定方式工作的计算机可读存储器中，使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品，该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。The computer program instructions can also be stored in a computer readable memory that can direct a computer or other programmable data processing terminal device to operate in a particular manner, such that the instructions stored in the computer readable memory produce an article of manufacture comprising the instruction device. The instruction device implements the functions specified in one or more blocks of the flowchart or in a flow or block of the flowchart.

这些计算机程序指令也可装载到计算机或其他可编程数据处理终端设备上，使得在计算机或其他可编程终端设备上执行一系列操作步骤以产生计算机实现的处理，从而在计算机或其他可编程终端设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded onto a computer or other programmable data processing terminal device such that a series of operational steps are performed on the computer or other programmable terminal device to produce computer-implemented processing, such that the computer or other programmable terminal device The instructions executed above provide steps for implementing the functions specified in one or more blocks of the flowchart or in a block or blocks of the flowchart.

尽管已描述了本申请实施例的优选实施例，但本领域内的技术人员一旦得知了基本创造性概念，则可对这些实施例做出另外的变更和修改。所以，所附权利要求意欲解释为包括优选实施例以及落入本申请实施例范围的所有变更和修改。While a preferred embodiment of the embodiments of the present application has been described, those skilled in the art can make further changes and modifications to the embodiments once they are aware of the basic inventive concept. Therefore, the appended claims are intended to be interpreted as including all the modifications and the modifications

最后，还需要说明的是，在本文中，诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来，而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者终端设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者终端设备所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括所述要素的过程、方法、物品或者终端设备中还存在另外的相同要素。Finally, it should also be noted that in this context, relational terms such as first and second are used merely to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply these entities. There is any such actual relationship or order between operations. Furthermore, the terms "comprises" or "comprising" or "comprising" or any other variations are intended to encompass a non-exclusive inclusion, such that a process, method, article, or terminal device that includes a plurality of elements includes not only those elements but also Other elements that are included, or include elements inherent to such a process, method, article, or terminal device. In the absence of more restrictions, elements defined by the phrase "including one..." are not excluded from the process, method, There are additional identical elements in the item or terminal device.

以上对本申请所提供的一种样本序列化方法和一种样本序列化装置，进行了详细介绍，本文中应用了具体个例对本申请的原理及实施方式进行了阐述，以上实施例的说明只是用于帮助理解本申请的方法及其核心思想；同时，对于本领域的一般技术人员，依据本申请的思想，在具体实施方式及应用范围上均会有改变之处，综上所述，本说明书内容不应理解为对本申请的限制。 The above is a detailed description of a sample serialization method and a sample serialization device provided by the present application. The principle and implementation manner of the present application are described in the specific examples. The description of the above embodiment is only used. To help understand the method of the present application and its core ideas; at the same time, for those of ordinary skill in the art, in accordance with the idea of the present application, there will be changes in the specific embodiments and application scope. The content should not be construed as limiting the application.

Claims

一种样本序列化方法，其特征在于，包括：A sample serialization method, comprising:

获取待序列化样本中的各个字符串；Get each string in the sample to be serialized;

根据各字符串与各管理服务器之间的对应关系，确定每个字符串对应的管理服务器；Determining, according to the correspondence between each character string and each management server, a management server corresponding to each character string;

将所述字符串发送至相应的管理服务器，以供各管理服务器根据其维护的映射表，将接收到的字符串转化为相应的序列化ID；其中，不同管理服务器维护的映射表中的字符串互不相同；Sending the character string to a corresponding management server, so that each management server converts the received string into a corresponding serialization ID according to the mapping table maintained by the management server; wherein characters in the mapping table maintained by different management servers Strings are different from each other;

接收各个管理服务器返回的对应各个字符串的序列化ID；Receiving a serialized ID corresponding to each character string returned by each management server;

根据接收到的各字符串对应的序列化ID，将各个样本数据中的字符串转换为相应的序列化ID。The character string in each sample data is converted into a corresponding serialization ID according to the serialized ID corresponding to each received character string.
根据权利要求1所述的方法，其特征在于，所述根据各字符串与各管理服务器之间的对应关系，确定每个字符串对应的管理服务器的步骤，包括：The method according to claim 1, wherein the step of determining a management server corresponding to each character string according to the correspondence between each character string and each management server comprises:

将字符串对应的哈希值除以管理服务器的个数，得到余数；Divide the hash value corresponding to the string by the number of management servers to obtain the remainder;

根据余数与管理服务器的对应关系，确定字符串对应的管理服务器。The management server corresponding to the character string is determined according to the correspondence between the remainder and the management server.
根据权利要求1所述的方法，其特征在于，在获取待序列化样本中的各个字符串的步骤之后，还包括：The method according to claim 1, wherein after the step of acquiring each character string in the sample to be serialized, the method further comprises:

对各个字符串进行去重。Deduplicate each string.
根据权利要求1-3其中之一所述的方法，其特征在于，所述获取待序列化样本中的各个字符串的步骤之前，还包括：The method according to any one of claims 1 to 3, wherein before the step of acquiring each character string in the sample to be serialized, the method further comprises:

获取由调度服务器对所有样本数据进行平均分配后，属于当前序列化服务器自己的一批样本数据。 Obtain a batch of sample data belonging to the current serialization server itself after all the sample data is evenly distributed by the scheduling server.
一种样本序列化方法，其特征在于，包括：A sample serialization method, comprising:

接收字符串；所述字符串由序列化服务器根据字符串与各管理服务器之间的对应关系发送；所述字符串由序列服务器从样本数据中获取；Receiving a string; the string is sent by the serialization server according to a correspondence between the string and each management server; the string is obtained by the sequence server from the sample data;

根据本地维护的映射表，将所接收到的字符串转换为序列化ID；其中，不同管理服务器维护的映射表中的字符串互不相同；Converting the received string into a serialized ID according to a locally maintained mapping table; wherein the strings in the mapping table maintained by different management servers are different from each other;

将所述字符串对应的序列化ID返回给相应的序列化服务器，以供序列化服务器根据接收到的各字符串对应的序列化ID，将各个样本数据中的字符串转换为相应的序列化ID。Returning the serialized ID corresponding to the string to the corresponding serialization server, so that the serialization server converts the string in each sample data into a corresponding serialization according to the serialized ID corresponding to each received string. ID.
根据权利要求5所述的方法，其特征在于，所述接收字符串的步骤之前，还包括：The method according to claim 5, wherein the step of receiving the character string further comprises:

获取属于当前管理服务器自己的一批字符串；其中，属于当前管理服务器的一批字符串与属于其他管理服务器的字符串不同；Obtain a batch of strings belonging to the current management server; wherein, a batch of strings belonging to the current management server is different from the strings belonging to other management servers;

将所述字符串进行序列化，并构建字符串与序列化ID的映射表。The string is serialized and a mapping table of strings and serialized IDs is constructed.
根据权利要求6所述的方法，其特征在于，所述将所述字符串进行序列化，并构建字符串与序列化ID的映射表的步骤，包括：The method according to claim 6, wherein the step of serializing the character string and constructing a mapping table of a string and a serialized ID comprises:

获取当前管理服务器的排序之前的各个管理服务器中的字符串的第一总数量N1；Obtaining a first total number N1 of strings in each management server before sorting of the current management server;

以所述第一总数量N1加上当前管理服务器的字符串的数量M得到第二总数量N2；Taking the first total number N1 plus the number M of strings of the current management server to obtain a second total number N2;

以[N1+1,N2]作为当前管理服务器对字符串序列化的范围。[N1+1, N2] is used as the range of string serialization by the current management server.
根据权利要求5-7其中之一所述的方法，其特征在于，所述根据本地维护的映射表，将所接收到的字符串转换为序列化ID的步骤包括：The method according to any one of claims 5-7, wherein the step of converting the received character string into a serialized ID according to the locally maintained mapping table comprises:

查询本地维护的映射表中是否有所述字符串； Query whether there is the string in the locally maintained mapping table;

如果本地维护的映射表中有所述字符串，则获取该字符串对应的序列化ID；If the string is in the locally maintained mapping table, the serialization ID corresponding to the string is obtained;

如果本地维护的映射表中没有所述字符串，则针对所述字符串生成序列化ID，并将所述字符串以及相应序列化ID加入映射表。If the string is not found in the locally maintained mapping table, a serialization ID is generated for the string, and the string and the corresponding serialization ID are added to the mapping table.
根据权利要求6或7所述的方法，，其特征在于，所述属于当前管理服务器的一批字符串包括：The method according to claim 6 or 7, wherein the batch of strings belonging to the current management server comprises:

所述一批字符串中各字符串所对应的余数属于当前管理服务器；所述余数为所述字符串对应的哈希值除以各个管理服务器的个数得到。The remainder corresponding to each character string in the batch of strings belongs to the current management server; the remainder is obtained by dividing the hash value corresponding to the string by the number of each management server.
一种样本序列化装置，其特征在于，包括：A sample serialization device, comprising:

字符串提取模块，用于获取待序列化样本中的各个字符串；a string extraction module, configured to acquire each character string in the sample to be serialized;

管理服务器确定模块，用于根据各字符串与各管理服务器之间的对应关系，确定每个字符串对应的管理服务器；a management server determining module, configured to determine, according to a correspondence between each character string and each management server, a management server corresponding to each character string;

字符串发送模块，用于将所述字符串发送至相应的管理服务器，以供各管理服务器根据其维护的映射表，将接收到的字符串转化为相应的序列化ID；其中，不同管理服务器维护的映射表中的字符串互不相同；a string sending module, configured to send the string to a corresponding management server, so that each management server converts the received string into a corresponding serialized ID according to a mapping table maintained by the management server; wherein, different management servers The strings in the maintained mapping table are different from each other;

序列化ID接收模块，用于接收各个管理服务器返回的对应各个字符串的序列化ID；a serialization ID receiving module, configured to receive a serialized ID corresponding to each character string returned by each management server;

样本序列化模块，用于根据接收到的各字符串对应的序列化ID，将各个样本数据中的字符串转换为相应的序列化ID。The sample serialization module is configured to convert the string in each sample data into a corresponding serialization ID according to the serialized ID corresponding to each received string.
根据权利要求10所述的装置，其特征在于，所述管理服务器确定模块包括：The device according to claim 10, wherein the management server determining module comprises:

字符串取余模块，用于将字符串对应的哈希值除以管理服务器的个数，得到余数；a string remainder module, which is used to divide the hash value corresponding to the string by the number of management servers to obtain a remainder;

第一余数确定模块，用于根据余数与管理服务器的对应关系，确定字符串对应的管理服务器。 The first remainder determining module is configured to determine a management server corresponding to the character string according to the correspondence between the remainder and the management server.
根据权利要求10所述的装置，其特征在于，在字符串提取模块之后，还包括：The device according to claim 10, further comprising: after the string extraction module,

去重模块，用于对各个字符串进行去重。De-duplication module for de-duplicating individual strings.
根据权利要求10-12其中之一所述的装置，其特征在于，所述字符串提取模块之前包括：The device according to any one of claims 10 to 12, wherein the character string extraction module previously comprises:

第一样本数据获取模块，用于获取由调度服务器对所有样本数据进行平均分配后，属于当前序列化服务器自己的一批样本数据。The first sample data obtaining module is configured to acquire a batch of sample data belonging to the current serialization server itself after the average distribution of all the sample data by the scheduling server.
一种样本序列化装置，其特征在于，包括：A sample serialization device, comprising:

字符串接收模块，用于接收字符串；所述字符串由序列化服务器根据字符串与各管理服务器之间的对应关系发送；所述字符串由序列服务器从样本数据中获取；a string receiving module, configured to receive a string; the string is sent by the serialization server according to a correspondence between the string and each management server; the string is obtained by the sequence server from the sample data;

字符串转换模块，用于根据本地维护的映射表，将所接收到的字符串转换为序列化ID；其中，不同管理服务器维护的映射表中的字符串互不相同；a string conversion module, configured to convert the received string into a serialized ID according to a locally maintained mapping table; wherein the strings in the mapping table maintained by different management servers are different from each other;

数字化ID返回模块，用于将所述字符串对应的序列化ID返回给相应的序列化服务器，以供序列化服务器根据接收到的各字符串对应的序列化ID，将各个样本数据中的字符串转换为相应的序列化ID。The digitized ID returning module is configured to return the serialized ID corresponding to the string to the corresponding serialization server, so that the serialization server sets the characters in each sample data according to the serialized ID corresponding to each received string. The string is converted to the corresponding serialized ID.
根据权利要求14所述的装置，其特征在于，所述字符串接收模块之前包括：The device according to claim 14, wherein the character string receiving module previously comprises:

字符串获取模块，用于获取属于当前管理服务器自己的一批字符串；其中，属于当前管理服务器的一批字符串与属于其他管理服务器的字符串不同；a string obtaining module, configured to acquire a batch of strings belonging to the current management server; wherein, the batch of strings belonging to the current management server is different from the strings belonging to other management servers;

映射表构建模块，用于将所述字符串进行序列化，并构建字符串与序列化ID的映射表。 A mapping table building block for serializing the string and constructing a mapping table of strings and serialized IDs.
根据权利要求15所述的装置，其特征在于，所述映射表构建模块包括：The apparatus according to claim 15, wherein the mapping table construction module comprises:

第一数量获取模块，用于获取当前管理服务器的排序之前的各个管理服务器中的字符串的第一总数量N1；a first quantity obtaining module, configured to acquire a first total number N1 of character strings in each management server before sorting of the current management server;

第二数量获取模块，用于以所述第一总数量N1加上当前管理服务器的字符串的数量M得到第二总数量N2；a second quantity obtaining module, configured to obtain a second total number N2 by using the first total number N1 plus the number M of strings of the current management server;

序列化范围确定模块，用于以[N1+1,N2]作为当前管理服务器对字符串序列化的范围。The serialization range determining module is configured to use [N1+1, N2] as the range of serialization of the string by the current management server.
根据权利要求14-16其中之一所述的装置，其特征在于，所述字符串转换模块包括：The apparatus according to any one of claims 14-16, wherein the character string conversion module comprises:

查询模块，用于查询本地维护的映射表中是否有所述字符串；a query module, configured to query whether the string is in the locally maintained mapping table;

第一数字化ID获取模块，用于如果本地维护的映射表中有所述字符串，则获取该字符串对应的序列化ID；a first digitized ID obtaining module, configured to obtain a serialized ID corresponding to the string if the string is in the locally maintained mapping table;

生成模块，用于如果本地维护的映射表中没有所述字符串，则针对所述字符串生成序列化ID，并将所述字符串以及相应序列化ID加入映射表。And a generating module, configured to generate a serialization ID for the string if the string is not in the locally maintained mapping table, and add the string and the corresponding serialization ID to the mapping table.
根据权利要求15或16所述的装置，其特征在于，所述属于当前管理服务器的一批字符串包括：The apparatus according to claim 15 or 16, wherein the batch of character strings belonging to the current management server comprises:

所述一批字符串中各字符串所对应的余数属于当前管理服务器；所述余数为所述字符串对应的哈希值除以各个管理服务器的个数得到。 The remainder corresponding to each character string in the batch of strings belongs to the current management server; the remainder is obtained by dividing the hash value corresponding to the string by the number of each management server.