CN117113929A - Method and device for splitting field data, electronic equipment and storage medium - Google Patents

Method and device for splitting field data, electronic equipment and storage medium Download PDF

Info

Publication number
CN117113929A
CN117113929A CN202311162305.8A CN202311162305A CN117113929A CN 117113929 A CN117113929 A CN 117113929A CN 202311162305 A CN202311162305 A CN 202311162305A CN 117113929 A CN117113929 A CN 117113929A
Authority
CN
China
Prior art keywords
data
target
splitting
field
data set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311162305.8A
Other languages
Chinese (zh)
Other versions
CN117113929B (en
Inventor
赵提
杨悦
杜啸争
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongdian Jinxin Digital Technology Group Co ltd
Original Assignee
Zhongdian Jinxin Digital Technology Group Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhongdian Jinxin Digital Technology Group Co ltd filed Critical Zhongdian Jinxin Digital Technology Group Co ltd
Priority to CN202311162305.8A priority Critical patent/CN117113929B/en
Priority claimed from CN202311162305.8A external-priority patent/CN117113929B/en
Publication of CN117113929A publication Critical patent/CN117113929A/en
Application granted granted Critical
Publication of CN117113929B publication Critical patent/CN117113929B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/131Fragmentation of text files, e.g. creating reusable text-blocks; Linking to fragments, e.g. using XInclude; Namespaces
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The application provides a field data splitting method, a field data splitting device, electronic equipment and a storage medium, wherein the field data splitting method comprises the following steps: selecting a target division mode from a plurality of candidate division modes according to the data characteristics of the historical service data, and setting the value of the division parameter corresponding to the target division mode; dividing the historical service data according to the target division mode and the set division parameter value to obtain a training data set, a test data set and a verification data set; training and optimizing the field data splitting model by using the training data set, the testing data set and the verification data set to obtain a field data splitting optimization model; and splitting field data of the field to be split in the file to be split by using a field data splitting optimization model. By adopting the field data splitting method, the device, the electronic equipment and the storage medium, the problem of low splitting accuracy when the data type of split data is complex and the data quantity is large in the existing field data splitting method is solved.

Description

Method and device for splitting field data, electronic equipment and storage medium
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a field data splitting method, a field data splitting device, an electronic device, and a storage medium.
Background
With the advent of the digitization era, data has become an important resource for enterprise development, and in the process of daily processing of business data of an enterprise, data splitting is often required for the business data. In the data splitting process, as the separation between the data is not obvious, the intervals between different columns are different in number of spaces, and the simple character strings are split by using a regular expression.
However, when the data type is complex and the data amount is large, the regular expression is adopted to perform the splitting process, resulting in the problem of low data splitting accuracy.
Disclosure of Invention
In view of the above, the present application aims to provide a field data splitting method, a field data splitting device, an electronic device and a storage medium, so as to solve the problem of low splitting accuracy when the data type of split data is complex and the data size is large.
In a first aspect, an embodiment of the present application provides a field data splitting method, including:
selecting a target division mode from a plurality of candidate division modes according to the data characteristics of the historical service data, and setting the value of a division parameter corresponding to the target division mode, wherein the data characteristics of the historical service data are the same as the data characteristics of the file to be split;
dividing the historical service data according to the target division mode and the set division parameter value to obtain a training data set, a test data set and a verification data set;
training and optimizing the field data splitting model by using the training data set, the testing data set and the verification data set to obtain a field data splitting optimization model;
splitting the field to be split in the file to be split by using a field data splitting optimization model.
Optionally, the plurality of candidate division modes include a random division mode, a custom division mode and a time sequence division mode, and the selecting a target division mode from the plurality of candidate division modes according to the data characteristics of the historical service data includes: if the historical service data has time sensitivity characteristics, selecting a time sequence division mode as a target division mode, wherein the time sensitivity characteristics are that the historical service data comprises service fields for indicating time; if the historical service data has the format fixed characteristic, selecting a self-defined dividing mode as a target dividing mode, wherein the format fixed characteristic is that the historical service data is configured according to a fixed format; and if the historical service data does not have the time sensitivity characteristic and the format fixing characteristic, selecting a random division mode as a target division mode.
Optionally, the method further comprises: if the historical service data has the characteristics of unbalanced classification or relevance, a random division mode is not selected as a target division mode; the characteristic of unbalanced classification is that the ratio of the data quantity between two types of data in the historical service data is larger than a ratio threshold value, and the characteristic of relevance is that the data of different rows in the historical service data have a relevance relation.
Optionally, the dividing parameters include a random proportion corresponding to a random dividing mode and a line identifier corresponding to a custom dividing mode, and the historical service data is divided according to a target dividing mode and a set value of the dividing parameters to obtain a training data set, a test data set and a verification data set, including: if the target division mode is a random division mode, randomly selecting a number of rows corresponding to a random proportion from the historical service data as target rows, and taking the service data corresponding to the target rows and the target service fields as a training data set; if the target division mode is a self-defined division mode, selecting a row corresponding to the row identification from the historical service data as a target row, and acquiring a training data set, a test data set and a verification data set from the service data corresponding to the target row and the target service field; and if the target division mode is a time sequence division mode, selecting service data in different preset ranges from the historical service data according to the time sequence as a training data set, a verification data set and a test data set respectively.
Optionally, the partitioning parameters further include a time sequence identifier corresponding to the time sequence partitioning manner, and before training and optimizing the field data splitting model by using the training data set, the test data set and the verification data set, the partitioning parameters further include: if the target division mode is a time sequence division mode, determining whether data which does not meet the format requirement exist in the service data corresponding to the time sequence identification; if the data which does not meet the format requirement exists, the data is converted to obtain the data which meets the format requirement.
Optionally, the method further comprises: determining the number of non-repeated values in the service data corresponding to the time series identification; if the number of non-duplicate values is less than the set number threshold, the time-sequential division cannot be used as the target division.
Optionally, the row identifier includes a training data row identifier, a test data row identifier, and a verification data row identifier, and acquiring a training data set, a test data set, and a verification data set from service data corresponding to the target row and the target service field includes: selecting a row corresponding to the training data row identification from the historical service data as a target training row, and taking the service data corresponding to the target training row and the target service field as a training data set; selecting a row corresponding to the test data row identification from the historical service data as a target test row, and taking the service data corresponding to the target test row and the target service field as a test data set; selecting a row corresponding to the verification data row identification from the historical service data as a target verification row, and taking the service data corresponding to the target verification row and the target service field as a verification data set.
In a second aspect, an embodiment of the present application further provides a field data splitting apparatus, where the apparatus includes:
the system comprises a mode selection module, a target division mode selection module and a splitting module, wherein the mode selection module is used for selecting a target division mode from a plurality of candidate division modes according to the data characteristics of historical service data, and setting the value of a division parameter corresponding to the target division mode, wherein the data characteristics of the historical service data are the same as the data characteristics of a file to be split;
the data dividing module is used for dividing the historical service data according to the target dividing mode and the set dividing parameter value to obtain a training data set, a test data set and a verification data set;
the model training module is used for training and optimizing the field data splitting model by utilizing the training data set, the test data set and the verification data set to obtain a field data splitting optimization model;
and a data splitting module. The method is used for splitting the field to be split in the file to be split by using the field data splitting optimization model.
In a third aspect, an embodiment of the present application further provides an electronic device, including: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory in communication over the bus when the electronic device is running, the machine-readable instructions when executed by the processor performing the steps of the field data splitting method as described above.
In a fourth aspect, embodiments of the present application also provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of a field data splitting method as described above.
The embodiment of the application has the following beneficial effects:
according to the field data splitting method, device, electronic equipment and storage medium, the corresponding target splitting mode can be determined according to the data characteristics, the data characteristics of the historical service data can be accurately reflected by the target splitting mode, the splitting accuracy of the historical service data with different data characteristics in data set splitting is improved, meanwhile, the data characteristics of the selected historical service data are the same as the data characteristics of the fields to be split, so that the data set obtained by utilizing the target splitting mode can also reflect the data characteristics of the data in the fields to be split more accurately, and therefore, when the field data splitting model is trained by utilizing the data characteristics reflecting the data in the fields to be split, the obtained field data splitting optimization model can also split the field data more accurately, and a more accurate splitting result is obtained. In addition, the target division mode can be a combination of multiple division modes, historical service data can be divided from multiple dimensions, the accuracy of data set division is further improved, and compared with a field data splitting method in the prior art, the problem that splitting accuracy is low when the data type of split data is complex and the data size is large is solved.
In order to make the above objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a field data splitting method provided by an embodiment of the present application;
fig. 2 is a schematic structural diagram of a field data splitting apparatus according to an embodiment of the present application;
fig. 3 shows a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. The components of the embodiments of the present application generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the application, as presented in the figures, is not intended to limit the scope of the application, as claimed, but is merely representative of selected embodiments of the application. Based on the embodiments of the present application, every other embodiment obtained by a person skilled in the art without making any inventive effort falls within the scope of protection of the present application.
It should be noted that, before the present application is proposed, with the advent of the digitization era, data has become an important resource for enterprise development, and in the process of daily processing of business data of an enterprise, data splitting is often required for the business data. In the data splitting process, as the separation between the data is not obvious, the intervals between different columns are different in number of spaces, and the simple character strings are split by using a regular expression. However, when the data type is complex and the data amount is large, the regular expression is adopted to perform the splitting process, resulting in the problem of low data splitting accuracy.
Based on the above, the embodiment of the application provides a field data splitting method to improve the accuracy of data splitting.
Referring to fig. 1, fig. 1 is a flowchart of a field data splitting method according to an embodiment of the application. As shown in fig. 1, a field data splitting method provided by an embodiment of the present application includes:
step S101, selecting a target division mode from a plurality of candidate division modes according to the data characteristics of the historical service data, and setting the value of the division parameter corresponding to the target division mode.
In this step, the historical service data may refer to service data having the same data characteristics as the file to be split, and exemplary, the historical service data may be financial service data.
Data characteristics may refer to characteristics that describe characteristics of the data that are used to determine the target partitioning approach. Data characteristics include, but are not limited to: time sensitivity features, format fixed features, classification imbalance features, relevance features. The data characteristics of the historical service data are the same as those of the file to be split, and the purpose is to improve the accuracy of the training field data splitting model.
The candidate division modes may refer to preset data set division modes.
The dividing parameters may refer to parameters used when dividing the data set, and the dividing parameters corresponding to different dividing modes are different.
In the embodiment of the application, the data in the files to be split by different enterprises or institutions are different, the data in the files to be split have different data characteristics, in order to be capable of splitting the fields to be split in the files to be split more accurately, the data characteristics of the files to be split can be determined first, then the historical service data with the same data characteristics as the files to be split are selected from various candidate historical service data to be used as target historical service data, and the target division mode is selected from various candidate division modes according to the data characteristics of the selected target historical service data.
When the target division method is selected, only one division method may be selected from the plurality of candidate division methods as the target division method, or two or more division methods may be selected from the plurality of candidate division methods as the target division method, so that the data is divided by using the selected one or more target division methods. The purpose of the processing is to adapt to the characteristics of complex data types and large data volume in the field to be split in the file to be split, and the data in the file to be split is divided in multiple layers in a layering division mode, so that the inaccuracy problem caused by processing multiple types and large data volume data at one time is avoided.
When the method is specifically executed, the execution sequence of the target division modes can be determined according to the selection sequence of the multiple target division modes, and each target division mode is executed in turn according to the determined execution sequence of the target division modes. In addition, when determining the execution sequence of the target division mode, the value of the division parameter corresponding to the target division mode can be set, and the historical service data is divided by using the set value of the division parameter.
Step S102, historical service data are divided according to a target division mode and the set value of the division parameter, and a training data set, a test data set and a verification data set are obtained.
In the step, the historical service data is divided according to the determined target division modes, the execution sequence among different target division modes and the value of the division parameter corresponding to each target division mode.
In the embodiment of the present application, the candidate division modes include a time sequence division mode, a random division mode, and a custom division mode, and the target division mode also includes at least one of the three division modes.
Assuming that the selected target division mode is a time sequence division mode, the historical service data is divided only according to the time sequence division mode and the value of the division parameter corresponding to the time sequence division mode.
Assuming that the selected target division mode is a time sequence division mode and a random division mode, the historical service data is divided for the first time only according to the time sequence division mode and the value of the division parameter corresponding to the time sequence division mode, and then the historical service data is divided for the second time according to the random division mode and the value of the division parameter corresponding to the random division mode. After division, a training data set, a test data set and a verification data set corresponding to the historical service data can be obtained.
Here, a plurality of historical service data may be obtained, a target division manner and a value of a division parameter corresponding to the historical service data are determined for each historical service data, the target division manner and the value of the division parameter are used as training labels, a data set division result is used as a test data set, and the historical service data is used as a training set. And training the data set division model by using the training set and the testing set to obtain a data set division optimization model, so that the data set division is performed on the data to be divided by using the data set division optimization model, and the data set division efficiency is improved. The dataset partitioning model may be an AutoModel model, among others.
And step S103, training and optimizing the field data splitting model by using the training data set, the test data set and the verification data set to obtain a field data splitting optimization model.
In this step, the field data splitting model may refer to a neural network model, and the field data splitting model is used to split the field to be split into a plurality of result fields according to the field names of the fields. The result field may refer to a field obtained after splitting.
In the embodiment of the application, firstly, a training data set is marked, namely, a label is added, a field to be split is taken as a single-day income amount/class as an example, and one row of data of the field to be split comprises the following contents: "2023-08-16 12000 garment", a plurality of labels are set for the data in the fields to be split, each label corresponds to a result field, and the labels are respectively: date, amount, category. And then training the field data splitting model by using the marked training data set, verifying the model training result by using the verification data set, and performing iterative training on the label according to the verification result to obtain the field data splitting optimization model.
And after the field data splitting optimization model determines a plurality of result fields corresponding to the fields to be split, splitting and combining each row of data in the fields to be split by utilizing the splitting rules and the combining rules corresponding to each result field so as to obtain the content in each split result field. For example: for the date, the model takes "-" as a separator to separate the year, month and day, and then the year, month and day are combined together according to a preset combination rule to obtain a separated result 2023.08.16 of the date of the first result field, and the content acquisition mode in other result fields is the same as that of the first result field, which is not repeated here.
After training and optimizing the field data splitting model by using the training data set and the verification data set, the splitting result of the field data splitting optimization model can be evaluated by using the test data set to determine whether the field data splitting optimization model meets the application standard.
And step S104, splitting the field to be split in the file to be split by using a field data splitting optimization model.
In the step, after determining that the field data splitting optimization model accords with the application standard, splitting the field to be split in the file to be split by using the field data splitting optimization model to obtain a field data splitting result.
In an alternative embodiment, the multiple candidate dividing modes include a random dividing mode, a custom dividing mode and a time sequence dividing mode, and step S101 includes: steps a1 to a3.
And a1, if the historical service data has time sensitivity characteristics, selecting a time sequence dividing mode as a target dividing mode.
Here, the time sensitivity characteristic is that the history service data includes a service field indicating time, for example: when the date field or the time field is included in the history service data, it may be determined that the history service data has a time-sensitive characteristic.
When the historical service data has time sensitivity characteristics, a time sequence division mode is selected as a target division mode. Here, the date field or the time field may be selected as a time column, and the data may be split according to the time sequence in the time column. Specifically, two scale values may be set, a first scale value for determining the training data set and a second scale value for determining the verification data set, for example: the first proportion value is 30%, the second proportion value is 40%, after time is ordered according to the sequence, the first 30% of data are used as training data sets, 30% to 70% of data are used as verification data sets, and the rest 30% of data are used as test sets.
And a2, if the historical service data has the format fixed characteristic, selecting a self-defined dividing mode as a target dividing mode.
Here, the format-fixed feature is that the history service data is configured in a fixed format, for example: the different sets of data are partitioned by using spaces or commas, or are combined together according to a specific rule.
When the historical service data has the format fixed characteristic, a user-defined dividing mode is selected as a target dividing mode. The self-defined dividing mode is a mode of dividing according to a specified symbol or format. For example: when the date field is divided, it is selected to divide by the designated "-" symbol.
It can be seen that when the data in the field to be split needs to include both the date and other data, and the result needs to be a result field separately, the time sequence division mode can be adopted first, and then the user-defined division mode can be adopted to carry out the combined division.
And a step a3, if the historical service data does not have the time sensitivity characteristic and the format fixing characteristic, selecting a random division mode as a target division mode.
If the historical service data has neither time sensitivity nor format fixing characteristics, a random division mode can be selected as a target division mode, and the influence of the format and the time characteristic is not needed to be considered. The random division mode is suitable for being used in scenes with relatively single data types and low requirements on the precision of the splitting result.
In an alternative embodiment, the method further comprises: if the historical service data has the characteristics of unbalanced classification or relevance, a random division mode is not selected as a target division mode.
Specifically, the classification imbalance is characterized by that the ratio of the data amount between two types of data in the historical service data is greater than a ratio threshold, for example: the data amount of the class a store data is 1000 lines, and the data amount of the class B store data is only 5 lines, and since 1000/5=200, 200 is greater than the ratio threshold 10, the data is provided with the classification imbalance feature.
The relevance features are that the data of different rows in the historical service data have relevance relations, for example: the data of the first row depends on the data of the second row, so that the data of the first row and the data of the second row have a dependency relationship, or the data of the first ten rows are all data of class A stores, the data of the second ten rows are all data of class B stores, and the data of the first ten rows have an association relationship, and the data of the second ten rows also have an association relationship.
When the historical service data has the characteristic of unbalanced classification, if the data set is divided according to a random division mode, the data set is possibly unbalanced in division, and the model training is inaccurate. When the historical service data has the relevance characteristic, if the data set is divided according to a random division mode, different types of data can be divided together, and the problem of inaccurate model training can be caused.
In an optional embodiment, the dividing parameter includes a random proportion corresponding to a random dividing manner and a line identifier corresponding to a custom dividing manner, and step S102 includes: steps b1 to b3.
And b1, if the target division mode is a random division mode, randomly selecting a number of rows corresponding to a random proportion from the historical service data as target rows, and taking the service data corresponding to the target rows and the target service fields as a training data set.
Assuming that the historical service data has 10000 rows in total and the random proportion=10%, 1000 rows are randomly selected from the historical service data as target rows, and the service data corresponding to the 1000 rows and the target service field are used as a training data set. The target service field is a field in the historical service data, and data which corresponds to the target service field and the target row together is taken as sample data.
And b2, if the target division mode is a self-defined division mode, selecting a row corresponding to the row identification from the historical service data as a target row, and acquiring a training data set, a test data set and a verification data set from the service data corresponding to the target row and the target service field.
If the target partitioning method is a custom partitioning method, a row identifier needs to be set, for example: and assuming the set row identification is 10 to 100, selecting the 10 th to 100 th rows from the historical service data as target rows, and acquiring a training data set, a test data set and a verification data set from the service data corresponding to the target service fields of the 10 th to 100 th rows.
And b3, if the target division mode is a time sequence division mode, selecting service data in different preset ranges from the historical service data according to time sequence as a training data set, a verification data set and a test data set respectively.
Here, the different preset ranges correspond to the ranges indicated by the first ratio value and the second ratio value. Taking the above example as an example, if the target division mode is a time sequence division mode, the first 30% of service data is selected from the historical service data according to the time sequence as a training data set, 30% to 70% of service data is used as a verification data set, and the rest of data is used as a test data set.
In an optional embodiment, the dividing parameter further includes a time column identifier corresponding to the time sequence dividing manner, and before step S103, the method further includes: if the target division mode is a time sequence division mode, determining whether data which does not meet the format requirement exist in the service data corresponding to the time sequence identification; if the data which does not meet the format requirement exists, the data is converted to obtain the data which meets the format requirement.
Specifically, if the target division mode is a time sequence division mode, the division parameters include a time column identifier, the time column identifier is used for indicating a time column, and if the time column identifier is date, it is determined whether the service data in the date field has data which does not meet the format requirement. Here, the format requirement is determined in advance, for example: the format of the date field must be year-month-day, and if not, it is determined that the format is not compliant. Exemplary: some data in the date field includes not only the year, month and day but also the time, and the data does not meet the format requirement, and then the data is converted into data including only the year, month and day, so as to obtain the data meeting the format requirement.
In an alternative embodiment, the method further comprises: determining the number of non-repeated values in the service data corresponding to the time series identification; if the number of non-duplicate values is less than the set number threshold, the time-sequential division cannot be used as the target division.
To ensure that the date field has enough non-duplicate values so that the validation data set and the test data set are not empty, statistics on the number of non-duplicate values are required. Taking the above example as an example, the number of non-repeated values in all the service data in the date field is counted, and if the number of non-repeated values is smaller than the set number threshold 20, the time-sequential division manner cannot be used as the target division manner.
In an alternative embodiment, the row identifier includes a training data row identifier, a test data row identifier, and a verification data row identifier, and step b2 includes: steps b21 to b23.
Here, the line identifier may be further refined into a training data line identifier, a test data line identifier, and a verification data line identifier, and when setting the value of the partition parameter corresponding to the target partition mode, if the target partition mode is a custom partition mode, the training data line identifiers 1 to 1000, the test data line identifiers 1001 to 2000, and the verification data line identifiers 2001 to 3000 may be set. The training data line identifier indicates a range of the training data set, the test data line identifier indicates a range of the test data set, and the verification data line identifier indicates a range of the verification data set.
And b21, selecting a row corresponding to the training data row identification from the historical service data as a target training row, and taking the service data corresponding to the target training row and the target service field as a training data set.
Taking the above example as an example, the 1 st to 1000 th rows are taken as target training rows, and the service data corresponding to the 1 st to 1000 th rows and the target service field are taken as training data sets.
Step b22, selecting a row corresponding to the test data row identification from the historical service data as a target test row, and taking the service data corresponding to the target test row and the target service field as a test data set.
Taking the above example as an example, 1001 st to 2000 th rows are taken as target test rows, and service data corresponding to 1001 st to 2000 th rows and target service fields are taken as test data sets.
Step b23, selecting a row corresponding to the verification data row identification from the historical service data as a target verification row, and taking the service data corresponding to the target verification row and the target service field as a verification data set.
Taking the above example as an example, the 2001 th to 3000 th rows are taken as target verification rows, and the business data corresponding to the 2001 th to 3000 th rows and target business fields are taken as verification data sets.
Compared with the field data splitting method in the prior art, the method can determine the corresponding target splitting mode according to the data characteristics, the target splitting mode can accurately reflect the data characteristics of the historical service data, the splitting accuracy of the historical service data with different data characteristics in the process of splitting the data set is improved, meanwhile, the data set obtained by the splitting of the target splitting mode can also accurately reflect the data characteristics of the data in the field to be split due to the fact that the data characteristics of the selected historical service data are identical to the data characteristics of the field to be split, and therefore, when the field data splitting model is trained by the data characteristics reflecting the data in the field to be split, the obtained field data splitting optimization model can also accurately split the field data to obtain more accurate splitting results. In addition, the target division mode can be a combination of multiple division modes, historical service data can be divided from multiple dimensions, accuracy of data set division is further improved, and the problem that when the data type of split data is complex and the data size is large, the split accuracy is low is solved.
Based on the same inventive concept, the embodiment of the present application further provides a field data splitting device corresponding to the field data splitting method, and since the principle of solving the problem by the device in the embodiment of the present application is similar to that of the field data splitting method in the embodiment of the present application, the implementation of the device may refer to the implementation of the method, and the repetition is omitted.
Referring to fig. 2, fig. 2 is a schematic structural diagram of a field data splitting apparatus according to an embodiment of the application. As shown in fig. 2, the field data splitting apparatus 200 includes:
a mode selection module 201, configured to select a target division mode from multiple candidate data set division modes according to data features of the historical service data, and set a value of a division parameter corresponding to the target division mode;
the data dividing module 202 is configured to divide the historical service data according to the target division manner and the set division parameter value, and obtain a training data set, a test data set and a verification data set;
the model training module 203 is configured to train and optimize the field data splitting model by using a training data set, a test data set and a verification data set, so as to obtain a field data splitting optimization model;
a data splitting module 204. The method is used for splitting the field to be split in the file to be split by using the field data splitting optimization model.
Referring to fig. 3, fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the application. As shown in fig. 3, the electronic device 300 includes a processor 310, a memory 320, and a bus 330.
The memory 320 stores machine-readable instructions executable by the processor 310, when the electronic device 300 is running, the processor 310 communicates with the memory 320 through the bus 330, and when the machine-readable instructions are executed by the processor 310, the steps of the field data splitting method in the method embodiment shown in fig. 1 can be executed, and the specific implementation can refer to the method embodiment and is not repeated herein.
The embodiment of the present application further provides a computer readable storage medium, where a computer program is stored, where the computer program when executed by a processor may perform the steps of the field data splitting method in the method embodiment shown in fig. 1, and the specific implementation manner may refer to the method embodiment and will not be described herein.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.
In the several embodiments provided by the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. The above-described apparatus embodiments are merely illustrative, for example, the division of the units is merely a logical function division, and there may be other manners of division in actual implementation, and for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some communication interface, device or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer readable storage medium executable by a processor. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
Finally, it should be noted that: the above examples are only specific embodiments of the present application, and are not intended to limit the scope of the present application, but it should be understood by those skilled in the art that the present application is not limited thereto, and that the present application is described in detail with reference to the foregoing examples: any person skilled in the art may modify or easily conceive of the technical solution described in the foregoing embodiments, or perform equivalent substitution of some of the technical features, while remaining within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application. Therefore, the protection scope of the application is subject to the protection scope of the claims.

Claims (10)

1. A method for splitting field data, comprising:
selecting a target division mode from a plurality of candidate division modes according to the data characteristics of the historical service data, and setting the value of a division parameter corresponding to the target division mode, wherein the data characteristics of the historical service data are the same as the data characteristics of a file to be split;
dividing the historical service data according to the target division mode and the set division parameter value to obtain a training data set, a test data set and a verification data set;
training and optimizing a field data splitting model by using the training data set, the test data set and the verification data set to obtain a field data splitting optimization model;
and splitting the field to be split in the file to be split by using the field data splitting optimization model.
2. The method of claim 1, wherein the plurality of candidate patterns include a random pattern, a custom pattern, and a time-sequential pattern, and wherein the selecting the target pattern from the plurality of candidate patterns according to the data characteristics of the historical traffic data comprises:
if the historical service data has time sensitivity characteristics, selecting a time sequence division mode as a target division mode, wherein the time sensitivity characteristics are that the historical service data comprises service fields indicating time;
if the historical service data has a format fixed characteristic, selecting a self-defined dividing mode as a target dividing mode, wherein the format fixed characteristic is that the historical service data is configured according to a fixed format;
and if the historical service data does not have the time sensitivity characteristic and the format fixing characteristic, selecting a random division mode as a target division mode.
3. The method according to claim 2, wherein the method further comprises:
if the historical service data has the characteristics of unbalanced classification or relevance, the random division mode is not selected as a target division mode;
the classification imbalance characteristic is that the ratio of the data quantity between two types of data in the historical service data is larger than a ratio threshold, and the relevance characteristic is that the data of different rows in the historical service data have relevance relations.
4. The method according to claim 2, wherein the dividing parameters include a random proportion corresponding to the random dividing manner and a line identifier corresponding to the custom dividing manner, the dividing the historical service data according to the target dividing manner and the set dividing parameter value, and obtaining a training data set, a test data set and a verification data set includes:
if the target division mode is a random division mode, randomly selecting a number of rows corresponding to the random proportion from the historical service data as target rows, and taking the service data corresponding to the target rows and the target service fields as a training data set;
if the target division mode is a custom division mode, selecting a row corresponding to the row identification from the historical service data as a target row, and acquiring a training data set, a test data set and a verification data set from the service data corresponding to the target row and the target service field;
and if the target division mode is a time sequence division mode, selecting service data in different preset ranges from the historical service data according to time sequence as a training data set, a verification data set and a test data set respectively.
5. The method according to claim 2, wherein the partitioning parameters further include a time column identifier corresponding to the time-sequential partitioning method, and before the training and optimizing the field data splitting model by using the training dataset, the test dataset, and the verification dataset, the method further includes:
if the target division mode is a time sequence division mode, determining whether data which does not meet the format requirement exist in the service data corresponding to the time sequence identification;
if the data which does not meet the format requirement exists, the data is converted to obtain the data which meets the format requirement.
6. The method of claim 5, wherein the method further comprises:
determining the number of non-repeated values in the service data corresponding to the time series identification;
if the number of non-duplicate values is less than the set number threshold, the time-sequential division cannot be used as the target division.
7. The method of claim 4, wherein the row identifier comprises a training data row identifier, a test data row identifier, and a verification data row identifier, and wherein the obtaining the training data set, the test data set, and the verification data set from the service data corresponding to the target row and the target service field comprises:
selecting a row corresponding to the training data row identification from the historical service data as a target training row, and taking the service data corresponding to the target training row and the target service field as a training data set;
selecting a row corresponding to the test data row identifier from the historical service data as a target test row, and taking the service data corresponding to the target test row and the target service field as a test data set;
selecting a row corresponding to the verification data row identification from the historical service data as a target verification row, and taking the service data corresponding to the target verification row and the target service field as a verification data set.
8. A field data splitting apparatus, comprising:
the system comprises a mode selection module, a target division mode selection module and a file splitting module, wherein the mode selection module is used for selecting a target division mode from a plurality of candidate division modes according to the data characteristics of historical service data, setting the value of a division parameter corresponding to the target division mode, and the data characteristics of the historical service data are the same as the data characteristics of a file to be split;
the data dividing module is used for dividing the historical service data according to the target dividing mode and the set dividing parameter value to obtain a training data set, a test data set and a verification data set;
the model training module is used for training and optimizing the field data splitting model by utilizing the training data set, the test data set and the verification data set to obtain a field data splitting optimization model;
and the data splitting module is used for splitting the field to be split in the file to be split by utilizing the field data splitting optimization model.
9. An electronic device, comprising: a processor, a storage medium and a bus, the storage medium storing machine-readable instructions executable by the processor, the processor and the storage medium communicating over the bus when the electronic device is running, the processor executing the machine-readable instructions to perform the steps of the field data splitting method of any of claims 1 to 7.
10. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program which, when executed by a processor, performs the steps of the field data splitting method of any of claims 1 to 7.
CN202311162305.8A 2023-09-08 Method and device for splitting field data, electronic equipment and storage medium Active CN117113929B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311162305.8A CN117113929B (en) 2023-09-08 Method and device for splitting field data, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311162305.8A CN117113929B (en) 2023-09-08 Method and device for splitting field data, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN117113929A true CN117113929A (en) 2023-11-24
CN117113929B CN117113929B (en) 2024-06-21

Family

ID=

Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106355391A (en) * 2015-07-16 2017-01-25 阿里巴巴集团控股有限公司 Service processing method and device
WO2017071369A1 (en) * 2015-10-31 2017-05-04 华为技术有限公司 Method and device for predicting user unsubscription
CN109255480A (en) * 2018-08-30 2019-01-22 中国平安人寿保险股份有限公司 Between servant lead prediction technique, device, computer equipment and storage medium
CN109388561A (en) * 2018-09-18 2019-02-26 深圳壹账通智能科技有限公司 Interface testing case generation method, device, computer equipment and storage medium
WO2020220220A1 (en) * 2019-04-29 2020-11-05 西门子(中国)有限公司 Classification model training method and device, and computer-readable medium
CN112200667A (en) * 2020-11-30 2021-01-08 上海冰鉴信息科技有限公司 Data processing method and device and computer equipment
CN113158650A (en) * 2021-05-12 2021-07-23 中国建设银行股份有限公司 Message processing method, device, equipment and storage medium
CN113657545A (en) * 2021-08-30 2021-11-16 平安医疗健康管理股份有限公司 Method, device and equipment for processing user service data and storage medium
CN113961473A (en) * 2021-11-15 2022-01-21 平安银行股份有限公司 Data testing method and device, electronic equipment and computer readable storage medium
CN114612251A (en) * 2022-03-14 2022-06-10 平安科技(深圳)有限公司 Risk assessment method, device, equipment and storage medium
CN115775060A (en) * 2022-03-24 2023-03-10 广东维正科技有限公司 Real estate stock data sorting method and application thereof
CN116092648A (en) * 2022-11-25 2023-05-09 泰康保险集团股份有限公司 Service processing method, device, electronic equipment and computer readable medium
CN116150663A (en) * 2021-11-22 2023-05-23 腾讯科技(深圳)有限公司 Data classification method, device, computer equipment and storage medium
CN116402596A (en) * 2023-02-15 2023-07-07 恒生电子股份有限公司 Data analysis method, device, computer equipment and readable storage medium
CN116579729A (en) * 2023-03-17 2023-08-11 中电金信数字科技集团有限公司 Service data processing method and device

Patent Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106355391A (en) * 2015-07-16 2017-01-25 阿里巴巴集团控股有限公司 Service processing method and device
WO2017071369A1 (en) * 2015-10-31 2017-05-04 华为技术有限公司 Method and device for predicting user unsubscription
CN109255480A (en) * 2018-08-30 2019-01-22 中国平安人寿保险股份有限公司 Between servant lead prediction technique, device, computer equipment and storage medium
CN109388561A (en) * 2018-09-18 2019-02-26 深圳壹账通智能科技有限公司 Interface testing case generation method, device, computer equipment and storage medium
WO2020220220A1 (en) * 2019-04-29 2020-11-05 西门子(中国)有限公司 Classification model training method and device, and computer-readable medium
CN112200667A (en) * 2020-11-30 2021-01-08 上海冰鉴信息科技有限公司 Data processing method and device and computer equipment
CN113158650A (en) * 2021-05-12 2021-07-23 中国建设银行股份有限公司 Message processing method, device, equipment and storage medium
CN113657545A (en) * 2021-08-30 2021-11-16 平安医疗健康管理股份有限公司 Method, device and equipment for processing user service data and storage medium
CN113961473A (en) * 2021-11-15 2022-01-21 平安银行股份有限公司 Data testing method and device, electronic equipment and computer readable storage medium
CN116150663A (en) * 2021-11-22 2023-05-23 腾讯科技(深圳)有限公司 Data classification method, device, computer equipment and storage medium
CN114612251A (en) * 2022-03-14 2022-06-10 平安科技(深圳)有限公司 Risk assessment method, device, equipment and storage medium
CN115775060A (en) * 2022-03-24 2023-03-10 广东维正科技有限公司 Real estate stock data sorting method and application thereof
CN116092648A (en) * 2022-11-25 2023-05-09 泰康保险集团股份有限公司 Service processing method, device, electronic equipment and computer readable medium
CN116402596A (en) * 2023-02-15 2023-07-07 恒生电子股份有限公司 Data analysis method, device, computer equipment and readable storage medium
CN116579729A (en) * 2023-03-17 2023-08-11 中电金信数字科技集团有限公司 Service data processing method and device

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
CHANSON: "模型评估(一):数据分割方法", Retrieved from the Internet <URL:https://zhuanlan.zhihu.com/p/362728791> *
SAAD AHMED QURESHI 等: "Telecommunication subscribers\' churn prediction model using machine learning", EIGHTH INTERNATIONAL CONFERENCE ON DIGITAL INFORMATION MANAGEMENT (ICDIM 2013), 2 January 2014 (2014-01-02), pages 131 - 136 *
施振兴: "推荐***综合仿真平台评估框架的研究与实现", 中国优秀硕士学位论文全文数据库 信息科技辑, 15 March 2016 (2016-03-15), pages 138 - 7998 *

Similar Documents

Publication Publication Date Title
US11741059B2 (en) System and method for extracting a star schema from tabular data for use in a multidimensional database environment
CN107943694B (en) Test data generation method and device
CN106919957A (en) The method and device of processing data
CN111858600A (en) Data wide table construction method, device, equipment and storage medium
CN116187285A (en) Telemetry data processing method, device, equipment and storage medium
CN109359027A (en) Monkey test method, device, electronic equipment and computer readable storage medium
CN108153663B (en) Page data processing method and device
CN111291567A (en) Evaluation method and device for manual labeling quality, electronic equipment and storage medium
CN110737432A (en) script aided design method and device based on root list
CN108073678B (en) Document analysis processing method, system and device applied to big data analysis
CN107067276A (en) Determine the method and device of object influences power
CN117113929B (en) Method and device for splitting field data, electronic equipment and storage medium
CN111311276B (en) Identification method and device for abnormal user group and readable storage medium
CN117113929A (en) Method and device for splitting field data, electronic equipment and storage medium
CN114443506B (en) Method and device for testing artificial intelligence model
CN111597452B (en) Method and device for adding tag, electronic equipment and readable storage medium
CN112699014A (en) Method and device for testing and displaying storage performance prediction function
CN113191905A (en) Shareholder data processing method and device, electronic equipment and readable storage medium
CN116860909B (en) Data storage method, system and storage medium based on biochemical knowledge graph
CN115525377B (en) Qualitative tag data visualization method and device, electronic equipment and medium
CN111428050B (en) Method and device for evaluating knowledge graph, computer storage medium and terminal
CN109145059A (en) For the data processing method of data statistics, server and storage medium
CN113076317B (en) Big data-based data processing method, device, equipment and readable storage medium
CN111309623B (en) Coordinate class data classification test method and device
TWI750969B (en) Computer-implemented system and computer-implemented method for resource usage metric grading

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant