CN112949763A

CN112949763A - Data extraction method, device, equipment and storage medium

Info

Publication number: CN112949763A
Application number: CN202110357011.5A
Authority: CN
Inventors: 张俊帆
Original assignee: Beijing QIYI Century Science and Technology Co Ltd
Current assignee: Beijing QIYI Century Science and Technology Co Ltd
Priority date: 2021-04-01
Filing date: 2021-04-01
Publication date: 2021-06-11

Abstract

The application relates to a data extraction method, a device, equipment and a storage medium, wherein the method comprises the following steps: determining processing parameters of each feature to be extracted in a data source; configuring a task template by using a storage address and a processing parameter of a data source to obtain a target processing task; and executing the target processing task to obtain sample data corresponding to each feature to be extracted. In addition, the scheme unifies the characteristic data of each feature to be extracted and the offline and real-time data of the training data, and can obtain the sample data of each feature to be extracted by one-time configuration, so that the method is convenient and fast.

Description

Data extraction method, device, equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a data extraction method, apparatus, device, and storage medium.

Background

With the development of the technology, the deep learning model is widely applied to various service scenes, and the use of the deep learning model not only simplifies the processing flow of the service, but also improves the user experience. Taking a service scene recommended by the advertisement as an example, training the training sample data enables the trained model to recommend the advertisement suitable for the user to the user based on the user data of the user.

Currently, a user of a deep learning model writes a data extraction task by training sample data, and the sample data is extracted from source data. However, each time a data extraction task is written by the user, the efficiency of obtaining sample data is low.

Disclosure of Invention

The application provides a data extraction method, a data extraction device and a storage medium, which are used for improving the efficiency of obtaining sample data and are as follows:

in a first aspect, a data extraction method is provided, including:

determining a processing parameter of each feature to be extracted in a data source, wherein the processing parameter is used for indicating data processing logic matched with the feature to be extracted;

configuring a task template by using the storage address of the data source and the processing parameter to obtain a target processing task;

and executing the target processing task to obtain sample data corresponding to each feature to be extracted, wherein the sample data is used for training a target model.

In a second aspect, a data extraction apparatus is provided, including:

the device comprises a determining unit, a judging unit and a processing unit, wherein the determining unit is used for determining processing parameters of each feature to be extracted in a data source, and the processing parameters are used for indicating data processing logic matched with the feature to be extracted;

the configuration unit is used for configuring the task template by using the storage address of the data source and the processing parameter to obtain a target processing task;

and the execution unit is used for executing the target processing task to obtain sample data corresponding to each feature to be extracted, and the sample data is used for training a target model.

In a third aspect, an electronic device is provided, including: the system comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus; the memory for storing a computer program; the processor is configured to execute the program stored in the memory, and implement the data extraction method according to the first aspect.

In a fourth aspect, a computer-readable storage medium is provided, which stores a computer program, and the computer program realizes the data extraction method of the first aspect when being executed by a processor. Compared with the prior art, the technical scheme provided by the embodiment of the application has the following advantages:

according to the technical scheme provided by the embodiment of the application, when sample data corresponding to each feature to be extracted needs to be extracted, the task template is configured by using the storage address and the processing parameter of the data source to obtain the target processing task, and the sample data is obtained based on the target processing task. In addition, the scheme unifies the off-line and real-time data of the feature data of each feature to be extracted and the training data, and can obtain the sample data of each feature to be extracted by one-time configuration, so that the method is convenient and fast, and finally, the sample data of each feature to be extracted can be directly obtained, thereby greatly improving the production iteration of the feature data and reducing the maintenance cost.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 is a schematic flow chart illustrating a data extraction method according to an embodiment of the present application;

FIG. 2 is a schematic flow chart of another data extraction method in the embodiment of the present application;

FIG. 3 is a schematic flow chart of another data extraction method in the embodiment of the present application;

FIG. 4 is a schematic structural diagram of a data extraction device in an embodiment of the present application;

fig. 5 is a schematic structural diagram of an electronic device in an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.

The embodiment of the application provides a data extraction method, which can be applied to any electronic equipment;

the electronic device described in the embodiment of the present application may include a smart Phone (e.g., an Android Phone, an iOS Phone, a Windows Phone, etc.), a tablet computer, a palm computer, a notebook computer, a video matrix, a monitoring platform, a Mobile Internet device (MID, Mobile Internet Devices), or a wearable device, which are merely examples, but not exhaustive, and include but are not limited to the foregoing Devices, and of course, the electronic device may also be a server.

As shown in fig. 1, the method may include the steps of:

step 101, determining processing parameters of each feature to be extracted in a data source.

Wherein the processing parameter is used to indicate the data processing logic that matches the feature to be extracted. And processing the data source according to the data processing logic to obtain sample data corresponding to each feature to be extracted. Specifically, the data source has source data for expressing each feature to be extracted, and the data processing logic indicated by the processing parameters may extract the source data for expressing each feature to be extracted from the data source, and use the extracted source data as the feature expression of each feature to be extracted.

In this embodiment, the features to be extracted in different service scenarios are different, and each feature to be extracted can be customized by a user. Alternatively, each feature to be extracted may be indicated by the user, or may be preset. When the features to be extracted indicated by the user exist, the feature expression of the features to be extracted indicated by the user is preferentially extracted, and when the features to be extracted indicated by the user do not exist, the preset feature expression of the features to be extracted is extracted by default by the electronic equipment.

When the user indicates each feature to be extracted, in actual application, the configuration interface of the electronic device displays an input box, and after the user inputs each feature to be extracted in the input box, the user indicates the completion of input through an input device (such as single-click or double-click operation of a mouse), so that the electronic device obtains each feature to be extracted according to the indication.

In this embodiment, because the data sources corresponding to different service scenarios are different, the data sources and the features to be extracted in different service scenarios cannot be cross-adapted. For example, the data source a of the service scenario a includes source data expressing each feature a to be extracted; the service scene B comprises source data expressing each feature B to be extracted. Because the data source a cannot be adapted to each feature B to be extracted, the data source a does not include source data expressing each feature B to be extracted.

In this embodiment, the format type of the data source includes, but is not limited to, at least one of a hive type or an hdfs type.

And 102, configuring a task template by using the storage address and the processing parameter of the data source to obtain a target processing task.

Optionally, the storage address of the data source includes, but is not limited to, a URL (uniform resource locator) or a file address.

In this embodiment, both the storage address and the processing parameter of the data source may be indicated by the user. When a user indicates a storage address of a data source, the electronic equipment acquires the storage address input by the user and takes the storage address as the storage address of the data source; or, the electronic device displays a plurality of storage addresses belonging to the storage address of the data source to the user based on the user operation, acquires a target storage address selected by the user from the plurality of storage addresses, and sets the target storage address as the storage address of the data source in step 102. Similarly, when the user indicates the processing parameter, the electronic device obtains the processing parameter input by the user, or displays a plurality of processing parameters to the user based on the user operation, and obtains a target processing parameter selected by the user from the plurality of processing parameters, and uses the target processing parameter as the processing parameter in step 102.

In this embodiment, the task template includes data to be configured and template data.

The template data comprises a version number of the task template, a search mode for a data source, and the like. When the task template is configured, the template data is not required to be changed; the data to be configured at least comprises a feature aggregation item and a data source configuration item, wherein the data source configuration item is used for configuring the storage address of the data source, and the feature aggregation item is used for configuring the processing parameters.

Optionally, the search manner for the data source includes, but is not limited to, a query manner based on an SQL statement. In order to realize the configuration of the format type of the data source by the user, the data source configuration items further comprise a type configuration item and a parameter configuration item, the type configuration item is used for configuring the format type of the data source, and the parameter configuration item is used for configuring the storage address of the data source.

It should be noted that the format type of the data source in the type configuration item may be configured according to an instruction of a user, and of course, a default format type may also be adopted, which is not specifically limited in this embodiment.

In order to facilitate reading of sample data corresponding to each feature to be extracted subsequently, the template data in the task template may further include a storage configuration item of the sample data, and the storage configuration item is used for configuring a storage address of the sample data.

Alternatively, the storage address of the sample data may be indicated by the user or a default address may be preset. Specifically, when the storage address of the sample data indicated by the user exists, configuring the storage configuration item by using the storage address of the sample data indicated by the user; and when the storage address of the sample data indicated by the user does not exist, configuring the storage configuration item by adopting a default storage address.

In one example, the default storage address may be a cloud storage address or an address in memory that is automatically allocated by the electronic device.

In order to make the code of the task template have a simple and clear hierarchical structure, the task template is obtained by writing in advance by using a JSON (JSON Object Notation) expression in the embodiment.

And 103, executing the target processing task to obtain sample data corresponding to each feature to be extracted, wherein the sample data is used for training the target model.

When the target model is a service scene recommended for the advertisement, the input of the target model is the feature data of the user to be predicted, and the output of the target model is the advertisement insertion times or videos suitable for the user to be predicted.

The user feature data is part of or all of the user portrait of the user.

When the target model is a business scenario recommended for a commodity, the input of the target model is the age of the user, and the output of the target model is a commodity suitable for the user.

When the electronic equipment executes the target processing task, the template code corresponding to the target processing task is operated, so that the template code calls the data processing logic indicated by the processing parameters to perform data processing, and sample data is obtained.

In this embodiment, the target model may be a model suitable for any service scenario, for example, when the service scenario is an advertisement recommendation scenario, the target model is a model for performing advertisement recommendation; when the business scene is a commodity recommendation scene, the target model is a commodity recommendation model and the like.

The present embodiment is described below with respect to a scenario in which an advertisement is recommended and a scenario in which a product is recommended, respectively:

in the context of advertisement recommendations:

the data source may be a log file storing viewing records of different users, and each feature to be extracted may include at least one of: age, gender, historical viewing history over a preset period of time, or number of commercial breaks per viewing.

The historical watching records in the preset time period can comprise the watching duration and/or the category of the watched videos in the preset time period. For example, the historical watching record in the preset time period may be a time period for watching a movie video, a time period for watching an art video, and/or a time period for watching an animation within 7 days.

The advertisement insertion times refer to the times of inserted medium-sized advertisements in the process of watching one video.

The goal model is then used to predict the number of commercial breaks or fits that are appropriate for the user to be predicted. Specifically, the target model is used for inputting feature data of the user to be predicted and outputting advertisement insertion times or videos suitable for the user to be predicted.

Under the scene of commodity recommendation:

the data source may be a log file storing shopping records of different users, and each feature to be extracted includes at least one of: life stage, gender, commodity category.

Illustratively, life stages may include children, adolescents, middle-aged, and elderly;

the life stage of the user can be determined according to the user image of the user.

When the young people are 15-30 years old, the middle-aged people are 31-50 years old and the old people are more than 50 years old, the age of the user in the shopping record of the user is analyzed through the data processing logic matched with the life stage, and therefore the life stage corresponding to the user is determined.

The target model is then used to predict the type of goods appropriate for the user to be predicted. Specifically, the target model is used for inputting feature data of the user to be predicted and outputting commodity types suitable for the user to be predicted.

In another embodiment of the present application, the processing parameter of each feature to be extracted may include a processing operator corresponding to each feature to be extracted and a data aggregation identifier corresponding to each feature to be extracted.

When determining the processing parameters of each feature to be extracted in the data source, as shown in fig. 2, the method may include the following steps:

step 201, acquiring a processing operator corresponding to each feature to be extracted indicated by a user.

Any one of the processing operators is used for indicating the extraction logic of the feature expression of the corresponding feature to be extracted.

Optionally, the processing operator may be an extraction function corresponding to the feature to be extracted, where the extraction function defines an extraction logic of the feature expression of the feature to be extracted.

Taking the feature to be extracted as the life stage as an example, defining the feature expression corresponding to the life stage to include young (18-30 years old), middle (30-50 years old) or old (more than 50 years old), the extraction function corresponding to the feature to be extracted may be:

optionally, in this embodiment, the processing operator for obtaining the user instruction may be implemented in the following two ways:

firstly, the electronic device displays a plurality of predefined processing operators to a user based on the operation of the user, acquires a target processing operator selected by the user from the plurality of predefined processing operators, and takes the target processing operator as the processing operator in step 201.

And secondly, the electronic equipment acquires a processing operator written by a user based on a frame of a user-defined operator in the electronic equipment, and the processing operator is used as the processing operator in the step 201.

Step 202, determining a data aggregation identifier corresponding to each feature to be extracted.

The data aggregation identifier is used for aggregating the feature expression of each feature to be extracted to obtain a plurality of pieces of sample data, and the data aggregation identifiers of any two pieces of sample data are different.

In this embodiment, the feature expressions of the features to be extracted of different users can be distinguished by using the data aggregation identifier, so that a plurality of pieces of sample data are obtained by using the data aggregation identifier.

Optionally, any piece of sample data may be sample data obtained by aggregating feature expressions of features to be extracted corresponding to the same user, and the data aggregation identifier may be identity information of the user, such as an Identifier (ID) of the user; or, any piece of sample data is sample data obtained by aggregating the feature expressions of the features to be extracted corresponding to at least two users, and the data aggregation identifier may be parameters such as gender of the user.

The data aggregation identifier may be indicated by a user, or a default feature may be preset as the data aggregation identifier.

In the following, taking the feature to be extracted as the life stage and gender as an example, setting the data aggregation identifier as the user ID, and setting the data of the data source as shown in table one, the process of obtaining sample data is briefly described:

watch 1

ID	Age (age)	Sex
			1	15	For male
2	24	Woman
			3	48	For male

The feature expressions of the features to be extracted, which are obtained through the processing operators corresponding to the features to be extracted, are respectively referred to as a second table and a third table:

watch two

ID	Stage of life
		1	Young people
2	Middle-aged
		3	Middle-aged

Watch III

ID	Sex
		1	For male
2	Woman
		3	For male

Please refer to table four for sample data obtained through data aggregation identifier:

watch four

ID	Stage of life	Sex
			1	Young people	For male
2	Middle-aged	Woman
			3	Middle-aged	For male

And step 203, determining the processing operators and the data aggregation identifiers as processing parameters.

Based on the specific implementation of the processing parameters, correspondingly, the presetting of the feature aggregation item includes extracting a logic configuration item and an aggregation parameter configuration item. The extraction logic configuration item is used for configuring the processing operator of each feature to be extracted, and the aggregation parameter configuration item is used for configuring the data aggregation identifier.

In the embodiment, the processing operators corresponding to the features to be extracted and the data aggregation identifiers corresponding to the features to be extracted are directly obtained, so that convenience of data extraction is improved, and efficiency of data extraction is improved.

In another embodiment of the present application, in order to improve operability of a user, a processing operator that can program a feature to be extracted on an electronic device is set. Specifically, the user indicates an operator parameter of the processing operator, and after the electronic device obtains the operator parameter indicated by the user, the electronic device calls a software development kit to generate the processing operator corresponding to the operator parameter, and stores the processing operator.

The operator parameters include, but are not limited to, operator names and description information.

In one example, when the processing operator is a summation operator, the operator name of the processing operator may be "summation (sum)", and the description information is "for summation".

Based on the same inventive concept, an embodiment of the present application further provides a data extraction apparatus, as shown in fig. 4, including:

a determining unit 401, configured to determine a processing parameter of each feature to be extracted in the data source, where the processing parameter is used to indicate a data processing logic matched with the feature to be extracted;

a configuration unit 402, configured to configure a task template by using a storage address and a processing parameter of a data source to obtain a target processing task;

and the execution unit 403 is configured to execute the target processing task to obtain sample data corresponding to each feature to be extracted, where the sample data is used to train the target model.

Optionally, the determining unit 401 includes:

acquiring processing operators corresponding to the features to be extracted indicated by a user, wherein any one processing operator is used for indicating the extraction logic of the feature expression of the corresponding feature to be extracted;

determining a data aggregation identifier corresponding to each feature to be extracted, wherein the data aggregation identifier is used for aggregating the feature expression of each feature to be extracted to obtain a plurality of sample data, and the data aggregation identifiers of any two sample data are different;

and determining the processing operator and the data aggregation identifier as processing parameters.

Optionally, the configuration unit 402 includes:

acquiring a data source configuration item and a feature aggregation item from a task template;

and configuring a data source configuration item by using the storage address of the data source, and configuring a characteristic aggregation item by using the processing parameter to obtain the target processing task.

Optionally, the execution unit 403 includes:

and operating the template code corresponding to the target processing task, so that the template code calls the data processing logic indicated by the processing parameters to perform data processing, and obtaining sample data.

Optionally, the method further comprises:

before determining the processing parameters of each feature to be extracted in the data source, acquiring the operator parameters of the processing operator corresponding to each feature to be extracted, wherein the operator parameters comprise operator names and description information;

and generating a processing operator corresponding to the operator parameter by using a software development kit, and storing the processing operator.

Optionally:

the data source comprises a log file, and the log file stores the watching records of the user;

each feature to be extracted includes at least one of: age, gender, historical viewing records within a preset time period or advertisement insertion times during each viewing;

the target model is used to predict the number of commercial breaks or videos suitable for the user to be predicted.

Optionally, the method further comprises:

and configuring the task template by using the storage address and the processing parameter of the data source, and writing the task template according to the JSON expression format before obtaining the target processing task.

Based on the same concept, an embodiment of the present application further provides an electronic device, as shown in fig. 5, the electronic device mainly includes: a processor 501, a communication interface 502, a memory 503 and a communication bus 504, wherein the processor 501, the communication interface 502 and the memory 503 are communicated with each other through the communication bus 504. The memory 503 stores a program executable by the processor 501, and the processor 501 executes the program stored in the memory 503 to implement the data extraction method described in any of the above embodiments.

The communication bus 504 mentioned in the above electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus 504 may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 5, but this is not intended to represent only one bus or type of bus.

The communication interface 502 is used for communication between the above-described electronic apparatus and other apparatuses.

The Memory 503 may include a Random Access Memory (RAM) or a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. Alternatively, the memory may be at least one memory device located remotely from the aforementioned processor 501.

The Processor 501 may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), etc., and may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic devices, discrete gates or transistor logic devices, and discrete hardware components.

In yet another embodiment of the present application, there is also provided a computer-readable storage medium having stored therein a computer program which, when run on a computer, causes the computer to perform the data extraction method described in any of the above embodiments.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wirelessly (e.g., infrared, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that includes one or more of the available media. The available media may be magnetic media (e.g., floppy disks, hard disks, tapes, etc.), optical media (e.g., DVDs), or semiconductor media (e.g., solid state drives), among others.

It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The foregoing are merely exemplary embodiments of the present invention, which enable those skilled in the art to understand or practice the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method of data extraction, comprising:

2. The method of claim 1, wherein the determining the processing parameters of each feature to be extracted in the data source comprises:

acquiring processing operators corresponding to the features to be extracted indicated by a user, wherein any one of the processing operators is used for indicating the extraction logic of the feature expression of the corresponding feature to be extracted;

and determining the processing operator and the data aggregation identifier as the processing parameter.

3. The method of claim 1, wherein configuring a task template using the storage address of the data source and the processing parameter to obtain a target processing task comprises:

acquiring a data source configuration item and a feature aggregation item from the task template;

and configuring the data source configuration item by using the storage address of the data source, and configuring the characteristic aggregation item by using the processing parameter to obtain the target processing task.

4. The method of claim 1, wherein said performing the target processing task to obtain sample data corresponding to the sample feature set comprises:

and running a template code corresponding to the target processing task, so that the template code calls the data processing logic indicated by the processing parameters to perform data processing, and the sample data is obtained.

5. The method of claim 1, wherein before determining the processing parameters of each feature to be extracted in the data source, the method further comprises:

acquiring operator parameters of processing operators corresponding to the features to be extracted, wherein the operator parameters comprise operator names and description information;

6. The method of claim 1, wherein:

the data source comprises a log file, and the log file stores the watching record of the user;

each feature to be extracted comprises at least one of the following: age, gender, historical viewing records within a preset time period or advertisement insertion times during each viewing;

the goal model is used to predict the number of commercial breaks or videos suitable for the user to be predicted.

7. The method of claim 1, wherein before the configuring the task template by using the storage address of the data source and the processing parameter to obtain the target processing task, the method further comprises:

and writing the task template according to the JSON expression format.

8. A data extraction apparatus, comprising:

9. An electronic device, comprising: the system comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus; the memory for storing a computer program; the processor, configured to execute the program stored in the memory, and implement the data extraction method according to any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that a computer program is stored which, when being executed by a processor, implements the data extraction method of any one of claims 1 to 7.