WO2016192583A1 - 数据仓库的数据处理方法及装置 - Google Patents

数据仓库的数据处理方法及装置 Download PDF

Info

Publication number
WO2016192583A1
WO2016192583A1 PCT/CN2016/083591 CN2016083591W WO2016192583A1 WO 2016192583 A1 WO2016192583 A1 WO 2016192583A1 CN 2016083591 W CN2016083591 W CN 2016083591W WO 2016192583 A1 WO2016192583 A1 WO 2016192583A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
dependency
queried
data processing
metadata
Prior art date
Application number
PCT/CN2016/083591
Other languages
English (en)
French (fr)
Inventor
吴勇军
Original Assignee
阿里巴巴集团控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司 filed Critical 阿里巴巴集团控股有限公司
Publication of WO2016192583A1 publication Critical patent/WO2016192583A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors

Definitions

  • the present application relates to data processing technologies, and in particular, to a data processing method and apparatus for a data warehouse.
  • a data warehouse is an environment that provides current and historical data that users use for decision support, which is difficult or impossible to obtain in traditional operational databases.
  • Data warehousing technology is a general term for various technologies and modules that effectively integrate operational data into a unified environment to provide decision-making data access. Everything is done to make data users more responsive to the information they need and to provide decision support for data users.
  • the data processing method in the prior art will make the unused data always occupy the computing resources and the storage resources, resulting in waste of resources.
  • the data processing method and device of the data warehouse are provided in the embodiment of the present application, which is used to solve the waste of resources caused by the useless resources of the prior art.
  • a data processing method for a data warehouse including: receiving a query condition input by a user, the query condition includes a keyword of the data to be queried; determining the data to be queried and the data warehouse according to the keyword
  • Other data dependencies, dependencies are one of the following: no dependencies, strong dependencies, weak dependencies; return dependencies to users; receive data processing instructions issued by users according to dependencies; trigger data warehouse to execute data for query data Processing instructions.
  • a data processing apparatus for a data warehouse including: a query module, configured to receive a query condition input by a user, where the query condition includes a keyword of the data to be queried; a dependency determination module, It is used to determine the dependency relationship between the data to be queried and other data in the data warehouse according to the keyword.
  • the dependency relationship is one of the following: no dependency, strong dependency, weak dependency; feedback module, used to return the dependency relationship to the user; instruction receiving
  • the module is configured to receive a data processing instruction issued by the user according to the dependency relationship, and a triggering module, configured to trigger the data warehouse to execute the data processing instruction on the query data.
  • the data processing method and device of the data warehouse in the embodiment of the present application can determine and return a dependency relationship between the data to be queried and other data to the user after receiving the query condition input by the user;
  • the resource efficiency of the warehouse can determine and return a dependency relationship between the data to be queried and other data to the user after receiving the query condition input by the user;
  • the data processing instruction of the data to be queried and then triggers the data warehouse to execute the data processing instruction; thereby, the data in the data warehouse can be processed according to the dependency relationship, thereby avoiding waste of resources caused by not processing the data in the prior art, and improving the data.
  • the resource efficiency of the warehouse can be determined and return a dependency relationship between the data to be queried and other data to the user after
  • FIG. 1 is a flowchart of a data processing method of a data warehouse according to Embodiment 1 of the present application;
  • FIG. 2 is a schematic diagram of a dependency query result according to a data processing method according to Embodiment 2 of the present application;
  • FIG. 3 is a structural block diagram of a data processing apparatus of a data warehouse according to Embodiment 3 of the present application.
  • FIG. 1 is a flowchart of a data processing method of a data warehouse according to Embodiment 1 of the present application.
  • the data processing method of the data warehouse according to the first embodiment of the present application includes the following steps:
  • S102 Receive a query condition input by a user, where the query condition includes a keyword of the data to be queried;
  • a table is the most important component of a data warehouse.
  • a table is usually composed of keywords, metrics, and attribute data.
  • an employee table consists of employee attribute data such as employee number, employee name, and age.
  • the view like the table, also contains a series of column and row data with names. However, the view does not exist in the database as a stored set of data values, but is defined by the query and can be treated as a virtual table.
  • Dependency refers to the relationship formed by the use or consumption of tables or views by other downstream views or tasks during data warehouse data development, or the relationship between the use or consumption of other tables or views in the process of forming tables or views. .
  • No dependence means that there is no dependency between data and other data; strong dependency means that there is a scheduling relationship between data and other data, which is the strongest and most intuitive kind of dependency; weak dependency means that data is not between Scheduling relationships, but can be resolved by executing statements such as SQL (Structured Query Language) logs or View DDL (Data Definition Language) statements; weak dependencies are more concealed during data development It is easy to be ignored; for example, tables are used by views, tables or views are used by data factories, scheduled tasks, data reflow production tasks, etc. are weak dependencies.
  • Each table or view is used by downstream tasks, and is also used by data users in IDE (Integrated Development Environment), reporting tools, and timed tasks.
  • IDE Integrated Development Environment
  • reporting tools and timed tasks.
  • timed tasks there are tens of thousands of tables in the data warehouse, and there are intricate dependencies. relationship.
  • the query condition input by the user includes a keyword of the data to be queried
  • the keyword may be a name of the table, or may be a node ID (an abbreviation of IDentity, an identity number), for example, the data to be queried is an employee.
  • the keyword may be an employee number that is a keyword of the table.
  • the data processing method in the embodiment of the present application may be implemented by using a oracle, a mysql, a teradata traditional database, or a distributed database such as Greenplum, Hadoop, or odps.
  • the dependency relationship between the data to be queried and other data in the data warehouse in the embodiment of the present application may be pre-generated, or may be generated after receiving the query request input by the user, and the application does not do this. limit.
  • the user after receiving the query condition input by the user, the user can determine and return the dependency relationship between the data to be queried and other data to the user;
  • the data processing instruction of the data then triggers the data warehouse to execute the data processing instruction; thereby, the data in the data warehouse can be processed according to the dependency relationship, thereby avoiding waste of resources caused by not processing the data in the prior art.
  • determining the dependency relationship between the data to be queried and other data in the data warehouse according to the keyword comprises: determining the data to be queried according to the keyword; and calling the metadata to generate a dependency relationship between the data to be queried and other data in the data warehouse.
  • Metadata refers to data describing data, descriptive information about data and information resources, including business table structure information, and several warehouse table structure information.
  • the metadata includes one or more of scheduling metadata, SQL execution log metadata, table structure metadata, synchronization center metadata, and timing task metadata.
  • the method further includes: providing the user with the data processing instruction for the data to be queried according to the dependency relationship.
  • the user may be provided with corresponding processing instructions after querying the dependency relationship of the corresponding data to be queried, including: if the dependency relationship of the query data is “no dependency”, then the user is Providing a data processing instruction corresponding to the data without dependency; if the dependency of the query data is "strong dependency”, providing the user with a data processing instruction corresponding to the strongly dependent data; if the dependency of the query data is "weakly dependent", Data processing instructions corresponding to weakly dependent data are provided to the user.
  • the data processing instructions are offline or changed.
  • the offline refers to physical deletion or renaming of the table
  • the change refers to updating the content or view logic of the table.
  • the data engineer can query the dependency of the data that wants to go offline or change; then select the offline or change according to the dependency; for example, if there is no dependency, the offline is performed. If it is a strong dependency, change and notify; if it is a weak dependency, make changes, etc., so that the data engineer can process the data in the data warehouse according to the dependency relationship, which facilitates data processing, improves the accuracy of impact assessment, and improves The efficiency and accuracy of data processing.
  • the query condition may further include querying the direction and level of the dependency of the data, for example, backing up the N level upstream, or querying the N level downstream.
  • the upstream backtracking is an N-level table or view that depends on the upstream query data to be queried;
  • the downstream query is an N-level table or view that is directed to the downstream query data to be queried.
  • the user can use the error check, model health check, data path length detection, data processing efficiency evaluation, etc. of the data to be queried.
  • the user can use the offline or change processing of the data to be queried.
  • the data processing method in the embodiment of the present application can perform function display based on the result of the metadata integration dependency, and provide N-level dependency query and presentation to the upstream and downstream, and the specific dependency result is shown in FIG. 2 .
  • the query blood type refers to the classification of the dependencies that the user wants to query, including: blood list, view blood, task blood, and the like.
  • the user selects the blood type to be queried as “table blood”, and the data to be queried is a table named “dwb_fnd_dback_all_dd”; the query level is 1, and the query direction is downstream.
  • the user After being processed by the data processing method of the embodiment of the present application, the user has feedback to the following node that has a dependency relationship with the "dwb_fnd_dback_all_dd” table: "dwd1”, “dws1”, “dws2”, “dwb1”, “dws3”, “st1” “, “dws4", “st2”, “adm1”, and provides the node name, table name, corresponding dependency and table type corresponding to these nodes.
  • the user can select the corresponding processing mode by clicking the right button at the corresponding node.
  • the result obtained by the query in the embodiment of the present application is “strong dependency”, so the “change” and “change notification” functions are provided to the user.
  • the user after receiving the query condition input by the user, the user can be determined and presented to the user. Returning the dependency relationship between the data to be queried and other data; the user can issue a data processing instruction for the data to be queried according to the dependency relationship, and then trigger the data warehouse to execute the data processing instruction; thereby processing the data in the data warehouse according to the dependency relationship It avoids the waste of resources in the prior art, improves the resource use efficiency of the data warehouse, reduces the error probability of data processing, and improves the efficiency and accuracy of data processing.
  • the data processing device of the data warehouse is also provided in the embodiment of the present application. Since the principle of solving the problem is similar to the data processing method, the implementation of the device can refer to the implementation of the method, and the repetition is not Let me repeat.
  • FIG. 3 is a structural block diagram of a data processing apparatus of a data warehouse according to Embodiment 3 of the present application.
  • the data processing apparatus 20 of the data warehouse includes: a query module 202, configured to receive a query condition input by a user, where the query condition includes a keyword of the data to be queried; the dependency determining module 204 For determining the dependency relationship between the data to be queried and other data in the data warehouse according to the keyword, the dependency relationship is one of the following: no dependency, strong dependency, weak dependency; the feedback module 206 is configured to return the dependency relationship to the user;
  • the instruction receiving module 208 is configured to receive a data processing instruction sent by the user according to the dependency relationship, and the triggering module 210 is configured to trigger the data warehouse to execute the data processing instruction on the query data.
  • the dependency determining module specifically includes: a determining submodule for determining data to be queried according to the keyword; and a dependency generating submodule for generating a dependency of the data to be queried according to the metadata.
  • the metadata includes one or more of scheduling metadata, SQL execution log metadata, table structure metadata, synchronization center metadata, and timing task metadata.
  • the data processing apparatus further comprises: an instruction providing module, configured to provide the user with data processing instructions for the data to be queried according to the dependency relationship.
  • the data processing instructions are offline or changed.
  • the data processing apparatus in the embodiment of the present application may be implemented in a language such as java, jsp, or .net.
  • the downstream production task dependence and data consumption of the data warehouse's table or view are intricate. Establishing full coverage data impact analysis is essential for data production management, which can reduce work complexity, improve development efficiency, and ensure work quality.
  • the data development engineer can intuitively determine the dependency relationship between the table or the view to be processed and other data based on the device, thereby intuitively determining the influence range of the data processing instruction to be executed, And whether it can be processed and changed offline.
  • the data processing apparatus in the embodiment of the present application may provide a dependency to the user through the query module. Relationship query service, offline, change notification inquiry service, etc.
  • the data processing apparatus in the embodiment of the present application may integrate the scheduling metadata, the SQL execution log metadata, the table structure metadata, the synchronization center metadata, the timing task metadata, etc. through the dependency generation sub-module. To accurately and comprehensively analyze the dependencies between data and produce interface tables.
  • the data processing apparatus in the embodiment of the present application may perform function presentation based on the result of the metadata integration dependency, and provide an N-level impact query and presentation to the upstream and downstream.
  • the data processing apparatus in the embodiment of the present application can provide a one-click offline function for a table or a view that is not dependent on and used in the downstream, and can also provide a task for performing offline deletion on a task that is not dependent on the downstream. Or rename features such as backups.
  • the data processing apparatus in the embodiment of the present application may further provide a change notification function to the changed table or view, so that the data development engineer can use the dependency relationship to the downstream task owner of the changed table or view ( Owner) or the user sends a change notification email.
  • the user inputs a table or a name, sets a level, selects a dependency query upstream or downstream, and the data processing device invokes the metadata service to query the dependency result and displays it, and the user can determine based on the result.
  • the offline operation or the change notification is performed. If there is downstream or usage information, the offline operation cannot be performed; if the offline operation is selected, the data processing device triggers the data warehouse to physically delete or rename the table or view and correspondingly If the change is selected, the change description is triggered, and the change notification is triggered.
  • the system automatically sends a change email to the downstream task owner and the data engineer, including the change description and the change impact list.
  • the user after receiving the query condition input by the user, the user can determine and return a dependency relationship between the data to be queried and other data to the user; and the user can send a data processing instruction for the data to be queried according to the dependency relationship. Then, the data warehouse is triggered to execute the data processing instruction; thereby, the data in the data warehouse can be processed according to the dependency relationship, thereby avoiding waste of resources caused by not processing the data in the prior art, improving resource utilization efficiency of the data warehouse, and reducing The error probability of data processing improves the accuracy of data processing.
  • embodiments of the present application can be provided as a method, system, or computer program product.
  • the present application can take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment in combination of software and hardware.
  • the application can be implemented on one or more computer usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) in which computer usable program code is embodied.
  • the form of a computer program product includes but not limited to disk storage, CD-ROM, optical storage, etc.
  • the computer program instructions can also be stored in a computer readable memory that can direct a computer or other programmable data processing device to operate in a particular manner, such that the instructions stored in the computer readable memory produce an article of manufacture comprising the instruction device.
  • the apparatus implements the functions specified in one or more blocks of a flow or a flow and/or block diagram of the flowchart.
  • These computer program instructions can also be loaded onto a computer or other programmable data processing device such that a series of operational steps are performed on a computer or other programmable device to produce computer-implemented processing for execution on a computer or other programmable device.
  • the instructions provide steps for implementing the functions specified in one or more of the flow or in a block or blocks of a flow diagram.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种数据仓库的数据处理方法及装置,该方法包括:接收用户输入的查询条件,查询条件包括待查询数据的关键词(S102);根据关键词确定待查询数据与数据仓库中其他数据的依赖关系(S104),依赖关系是下述的一种:无依赖、强依赖、弱依赖;向用户返回依赖关系(S106);接收用户根据依赖关系下发的数据处理指令(S108);触发数据仓库对待查询数据执行数据处理指令(S110)。采用该方法,能够提升数据仓库的资源使用效率。

Description

数据仓库的数据处理方法及装置
本申请要求2015年06月04日递交的申请号为201510303311.X、发明名称为“数据仓库的数据处理方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及数据处理技术,特别涉及一种数据仓库的数据处理方法及装置。
背景技术
随着90年代后期因特网的兴起与飞速发展,大量的信息和数据迎面而来,用科学的方法去整理数据,从而从不同视角对企业经营各方面信息的精确分析、准确判断,比以往更为迫切,实施行为的有效性也比以往更受关注。使用这些技术建设的信息***称为数据仓库。
数据仓库是一个环境,提供用户用于决策支持的当前和历史数据,这些数据在传统的操作型数据库中很难或不能得到。数据仓库技术是为了有效的把操作形数据集成到统一的环境中以提供决策型数据访问的各种技术和模块的总称。所做的一切都是为了让数据使用者能够更快更方便查询所需要的信息,为数据使用者提供决策支持。
在现有技术中,为避免下游数据工程师产出的指标出现错误或者数据业务逻辑出现缺陷;通常采用的手段是不对数据仓库中的数据进行处理。
采用现有技术中的数据处理方法,将使得已经没有用的数据一直占有计算资源和存储资源,导致资源浪费。
发明内容
本申请实施例中提供了一种数据仓库的数据处理方法和装置,用于解决现有技术中无用数据占有资源导致的资源浪费。
根据本申请实施例的一个方面,提供了一种数据仓库的数据处理方法,包括:接收用户输入的查询条件,查询条件包括待查询数据的关键词;根据关键词确定待查询数据与数据仓库中其他数据的依赖关系,依赖关系是下述的一种:无依赖、强依赖、弱依赖;向用户返回依赖关系;接收用户根据依赖关系下发的数据处理指令;触发数据仓库对待查询数据执行数据处理指令。
根据本申请实施例的另一个方面,提供了一种数据仓库的数据处理装置,包括:查询模块,用于接收用户输入的查询条件,查询条件包括待查询数据的关键词;依赖关系确定模块,用于根据关键词确定待查询数据与数据仓库中其他数据的依赖关系,依赖关系是下述的一种:无依赖、强依赖、弱依赖;反馈模块,用于向用户返回依赖关系;指令接收模块,用于接收用户根据依赖关系下发的数据处理指令;触发模块,用于触发数据仓库对待查询数据执行数据处理指令。
采用本申请实施例中的数据仓库的数据处理方法和装置,能够在接收到用户输入的查询条件后,确定并向用户返回待查询数据与其他数据的依赖关系;供用户根据依赖关系下发针对待查询数据的数据处理指令,然后再触发数据仓库执行数据处理指令;从而能够根据依赖关系对数据仓库中的数据进行处理,避免了现有技术中不对数据进行处理导致的资源浪费,提升了数据仓库的资源使用效率。
附图说明
此处所说明的附图用来提供对本申请的进一步理解,构成本申请的一部分,本申请的示意性实施例及其说明用于解释本申请,并不构成对本申请的不当限定。在附图中:
图1为本申请实施例一所示数据仓库的数据处理方法流程图;
图2是根据本申请实施例二的数据处理方法的依赖关系查询结果示意图;
图3是根据本申请实施例三的数据仓库的数据处理装置的结构框图。
具体实施方式
为了使本申请实施例中的技术方案及优点更加清楚明白,以下结合附图对本申请的示例性实施例进行进一步详细的说明,显然,所描述的实施例仅是本申请的一部分实施例,而不是所有实施例的穷举。需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。
本申请实施例中的方案能够应用于如数据字典等工具的依赖关系(血缘)影响分析功能上,但本领域技术人员应当理解,上述应用只是为便于本技术技术人员理解本申请的目的示出,并不用于限制本申请。
图1为本申请实施例一所示数据仓库的数据处理方法流程图。
如图1所示,根据本申请实施例一所示的数据仓库的数据处理方法包括以下步骤:
S102,接收用户输入的查询条件,查询条件包括待查询数据的关键词;
S104,根据关键词确定待查询数据与数据仓库中其他数据的依赖关系,依赖关系是下述的一种:无依赖、强依赖、弱依赖;
S106,向用户返回依赖关系;
S108,接收用户根据依赖关系下发的数据处理指令;
S110,触发数据仓库对待查询数据执行数据处理指令。
本领域技术人员应当理解,数据仓库中存储的数据主要是数据开发产出的物理表或视图。表是数据仓库最重要的组成部分,表通常由关键词key,度量,属性数据组成,例如员工表由员工号(key),员工姓名,年龄等员工属性数据组成。视图同表一样,也包含一系列带有名称的列和行数据,但是,视图在数据库中并不以存储的数据值集的形式存在,而是由查询定义,可以视为虚拟的表。
依赖关系是指数据仓库数据研发过程中,表或视图被下游其他视图或任务使用、消费而形成的关系,或者表或视图在形成过程中对上游其他表或视图的使用、消费而形成的关系。
无依赖是指数据与其他数据之间没有任何的依赖关系;强依赖是指数据与其他数据之间存在调度关系,是最强也是最直观的一种依赖关系;弱依赖是指数据之间不是调度关系,但可以通过执行例如SQL((Structured Query Language,结构化查询语言)日志或视图DDL(Data Definition Language,数据库模式定义语言)语句解析出来的依赖关系;弱依赖在数据研发过程中比较隐蔽,很容易被忽略掉;例如,表被视图使用、表或视图被数据工厂、定时任务、数据回流生产任务等使用均是弱依赖关系。
各表或视图被下游任务所依赖使用,也被数据使用者在IDE(Integrated Development Environment,集成开发环境)、报表工具、定时任务等工具使用,目前数据仓库有上万张表,存在错综复杂的依赖关系。
在具体实施时,用户输入的该查询条件包括待查询数据的关键词,该关键词可以是表的名字,也可以是节点ID(IDentity的缩写,身份标识号码),例如,待查询数据是员工表时,该关键词可以是作为该表的关键词的员工号。
在具体实施时,采用oracle、mysql、teradata传统数据库或者Greenplum、hadoop、odps等分布式数据库都可以实施本申请实施例中的数据处理方法。
在具体实施时,本申请实施例中待查询数据与数据仓库中其他数据的依赖关系可以是预先生成的,也可以是在接受到用户输入的查询请求之后生成的,本申请对此并不做限制。
采用本申请实施例中的数据仓库的数据处理方法,能够在接收到用户输入的查询条件后,确定并向用户返回待查询数据与其他数据的依赖关系;供用户根据依赖关系下发针对待查询数据的数据处理指令,然后再触发数据仓库执行数据处理指令;从而能够根据依赖关系对数据仓库中的数据进行处理,避免了现有技术中不对数据进行处理导致的资源浪费。
优选地,根据关键词确定待查询数据与数据仓库中其他数据的依赖关系具体包括:根据关键词确定待查询数据;调用元数据生成待查询数据与数据仓库中其他数据的依赖关系。
元数据是指描述数据的数据,对数据及信息资源的描述性信息,包括业务表结构信息、数仓表结构信息等。
优选地,元数据包括调度元数据、SQL执行日志元数据、表结构元数据、同步中心元数据、定时任务元数据中的一个或多个。
优选地,在向用户返回依赖关系之后,在接收用户根据依赖关系下发的数据处理指令之前;还包括:根据依赖关系向用户提供针对待查询数据的数据处理指令。
为了便于用户对查询的数据进行数据处理,还可以在查询到相应待查询数据的依赖关系之后,向用户提供对应的处理指令,包括:如果查询数据的依赖关系是“无依赖”,则向用户提供对应于无依赖数据的数据处理指令;如果查询数据的依赖关系是“强依赖”,则向用户提供对应于强依赖数据的数据处理指令;如果查询数据的依赖关系是“弱依赖”,则向用户提供对应于弱依赖数据的数据处理指令。
优选地,数据处理指令是下线或变更。
本领域技术人员应当理解,下线是指对表进行物理删除或重命名备份;变更是指对表的内容或视图逻辑进行更新。
在具体实施时,对于无依赖关系的数据,则提供“下线”和“变更”处理指令,对于存在强依赖关系的数据,则提供“变更”功能及“变更通知”功能;对于存在弱依赖关系的数据,则提供“变更”等,本领域技术人员应当理解,上述依赖关系与处理指令之间的关系仅是为示例的目的而示出,并不用于限制本申请。
在现有技术中,由于数据仓库中的表与视图之间的错综复杂的依赖或使用关系,在数据工程师想要对数据进行下线或变更时,只能手动查询该数据与其他数据的依赖关系,然后再根据该依赖关系进行下线或是变更,但是手动的查询不能穷尽数据仓库,导致变更的影响范围不确定,会造成使用数据的工程师产出指标错误或数据业务逻辑出现缺陷, 导致资损或客户投诉;同时手动的维护工作量也较繁重;如果想要穷尽,则手动查询的成本很高。
而采用本申请实施例中的方案,数据工程师可以查询想要下线或是变更的数据的依赖关系;然后根据该依赖关系选择下线或是变更;例如,如果无依赖,则进行下线,如果是强依赖,则进行变更并通知;如果是弱依赖,则进行变更等,从而使得数据工程师能够根据依赖关系对数据仓库中的数据进行处理,方便了数据处理,提升影响评估准确性,提高了数据处理的效率和准确度。
在具体实施时,查询条件还可以进一步包括查询数据的依赖关系的方向和层级,例如,向上游回溯N级,或者向下游查询N级。
向上游回溯是指向上游查询待查询数据所依赖的N级表或视图;向下游查询是指向下游查询待查询数据所被依赖的N级表或视图。
根据待查询数据与上游数据的依赖关系,用户可以用于待查询数据的出错检查、模型健康检查、数据路径长度检测、数据处理效率评估等。
对于待查询数据与下游数据的依赖关系,用户可以用于待查询数据的下线或变更处理等。
下面结合图2对根据本申请实施例二的数据处理方法进行介绍。
本申请实施例中的数据处理方法可以基于元数据整合的依赖关系结果进行功能展现,并提供向上游、下游设定N级依赖关系查询及展现,具体的依赖关系结果展现如图2所示。
图2中,查询血缘类型即是指用户想要查询的依赖关系的分类,包括:表血缘、视图血缘、任务血缘等。
在具体实施时,用户选择想要查询的血缘类型为“表血缘”,待查询的数据是表名为“dwb_fnd_dback_all_dd”的表;查询层次为1,查询方向为下游。
经本申请实施例的数据处理方法处理后,向用户反馈与“dwb_fnd_dback_all_dd”表存在依赖关系的有以下节点:“dwd1”、“dws1”、“dws2”、“dwb1”、“dws3”、“st1”、“dws4”、“st2”、“adm1”,并提供了与这些节点相应的节点名、表名、以相应的依赖关系和表类型。
用户在相应的节点处点击右键可以选择相应的处理方式,本申请实施例中查询得到的结果均为“强依赖”,因此向用户提供“变更”及“变更通知”功能。
采用本申请实施例中的方案,能够在接收到用户输入的查询条件后,确定并向用户 返回待查询数据与其他数据的依赖关系;供用户根据依赖关系下发针对待查询数据的数据处理指令,然后再触发数据仓库执行数据处理指令;从而能够根据依赖关系对数据仓库中的数据进行处理,避免了现有技术中的资源浪费,提升了数据仓库的资源使用效率,降低了数据处理的出错概率,提高了数据处理的效率和准确度。
基于同一发明构思,本申请实施例中还提供了一种数据仓库的数据处理装置,由于该装置解决问题的原理与数据处理方法相似,因此该装置的实施可以参见方法的实施,重复之处不再赘述。
图3是根据本申请实施例三的数据仓库的数据处理装置的结构框图。
如图3所示,根据本申请实施例二的数据仓库的数据处理装置20包括:查询模块202,用于接收用户输入的查询条件,查询条件包括待查询数据的关键词;依赖关系确定模块204,用于根据关键词确定待查询数据与数据仓库中其他数据的依赖关系,依赖关系是下述的一种:无依赖、强依赖、弱依赖;反馈模块206,用于向用户返回依赖关系;指令接收模块208,用于接收用户根据依赖关系下发的数据处理指令;触发模块210,用于触发数据仓库对待查询数据执行数据处理指令。
优选地,依赖关系确定模块具体包括:确定子模块,用于根据关键词确定待查询数据;依赖关系生成子模块,用于根据元数据生成待查询数据的依赖关系。
优选地,元数据包括调度元数据、SQL执行日志元数据、表结构元数据、同步中心元数据、定时任务元数据中的一个或多个。
优选地,该数据处理装置还包括:指令提供模块,用于根据依赖关系向用户提供针对待查询数据的数据处理指令。
优选地,数据处理指令是下线或变更。
在具体实施时,可以使用java、jsp或者.net等语言实现本申请实施例中的数据处理装置。
数据仓库的表或视图的下游生产任务依赖、数据消费是错综复杂的,建立起全覆盖的数据影响分析,对于数据生产管理至关重要,可以降低工作复杂度、提升开发效率、保障工作质量。通过本申请实施例中的数据处理装置,数据开发工程师可以基于该装置很直观地判断将要处理的表或视图与其他数据的依赖关系,从而很直观的确定将要执行的数据处理指令的影响范围、以及能否进行下线处理和变更。
在具体实施时,本申请实施例中的数据处理装置可以通过查询模块向用户提供依赖 关系查询服务、下线、变更通知查询服务等。
在具体实施时,本申请实施例中的数据处理装置可以通过依赖关系生成子模块,对调度元数据、SQL执行日志元数据、表结构元数据、同步中心元数据、定时任务元数据等进行整合,以精准、全面分析数据之间的依赖关系,并产出接口表。
在具体实施时,本申请实施例中的数据处理装置可以基于元数据整合的依赖关系结果进行功能展现,并提供向上游、下游设定N级影响查询及展现。
在具体实施时,本申请实施例中的数据处理装置可以对下游没有依赖、使用的表或视图提供一键下线功能,还可以提供对下游没有依赖的任务进行下线,对表进行物理删除或重命名备份等功能。
在具体实施时,本申请实施例中的数据处理装置还可以对变更后的表或视图提供变更通知功能,以便于数据开发工程师可以基于依赖关系对变更后的表或视图的下游任务所有者(owner)或使用者发送变更通知邮件。
采用本申请实施例中的方案,用户输入表或名字、设定层级、选择向上游或向下游进行依赖关系查询,数据处理装置调用元数据服务查询依赖关系结果并展示出来,用户可以基于结果判定是进行下线操作还是变更通知,如果有下游或使用信息,则不能进行下线操作;如果选择下线操作,则数据处理装置触发数据仓库对表或视图进行物理删除或重命名并将对应的任务进行下线处理;如果选择变更,则填写变更描述后,触发变更,并发送变更通知,***自动对下游任务owner、使用数据工程师发送变更邮件,内容包括变更描述、变更影响清单等。
采用本申请实施例中的方案,能够在接收到用户输入的查询条件后,确定并向用户返回待查询数据与其他数据的依赖关系;供用户根据依赖关系下发针对待查询数据的数据处理指令,然后再触发数据仓库执行数据处理指令;从而能够根据依赖关系对数据仓库中的数据进行处理,避免了现有技术中不对数据进行处理导致的资源浪费,提升了数据仓库的资源使用效率,降低了数据处理的出错概率,提高了数据处理的准确度。
为了描述的方便,以上所述装置的各部分以功能分为各种部件或单元分别描述。当然,在实施本申请时可以把各部件或单元的功能在同一个或多个软件或硬件中实现。
本领域内的技术人员应明白,本申请的实施例可提供为方法、***、或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的 计算机程序产品的形式。
本申请是参照根据本申请实施例的方法、设备(***)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
尽管已描述了本申请的优选实施例,但本领域内的技术人员一旦得知了基本创造性概念,则可对这些实施例作出另外的变更和修改。所以,所附权利要求意欲解释为包括优选实施例以及落入本申请范围的所有变更和修改。
显然,本领域的技术人员可以对本申请进行各种改动和变型而不脱离本申请的精神和范围。这样,倘若本申请的这些修改和变型属于本申请权利要求及其等同技术的范围之内,则本申请也意图包含这些改动和变型在内。

Claims (10)

  1. 一种数据仓库的数据处理方法,其特征在于,包括:
    接收用户输入的查询条件,所述查询条件包括待查询数据的关键词;
    根据所述关键词确定所述待查询数据与所述数据仓库中其他数据的依赖关系,所述依赖关系是下述的一种:无依赖、强依赖、弱依赖;
    向用户返回所述依赖关系;
    接收用户根据所述依赖关系下发的数据处理指令;
    触发所述数据仓库对所述待查询数据执行所述数据处理指令。
  2. 根据权利要求1所述的方法,其特征在于,根据所述关键词确定所述待查询数据与所述数据仓库中其他数据的依赖关系具体包括:
    根据所述关键词确定所述待查询数据;
    调用元数据生成所述待查询数据与所述数据仓库中其他数据的依赖关系。
  3. 根据权利要求2所述的方法,其特征在于,所述元数据包括调度元数据、结构化查询语言SQL执行日志元数据、表结构元数据、同步中心元数据、定时任务元数据中的一个或多个。
  4. 根据权利要求1所述的方法,其特征在于,在向用户返回所述依赖关系之后,在接收用户根据所述依赖关系下发的数据处理指令之前;还包括:
    根据所述依赖关系向用户提供针对所述待查询数据的数据处理指令。
  5. 根据权利要求1所述的方法,其特征在于,所述数据处理指令是下线或变更。
  6. 一种数据仓库的数据处理装置,其特征在于,包括:
    查询模块,用于接收用户输入的查询条件,所述查询条件包括待查询数据的关键词;
    依赖关系确定模块,用于根据所述关键词确定所述待查询数据与所述数据仓库中其他数据的依赖关系,所述依赖关系是下述的一种:无依赖、强依赖、弱依赖;
    反馈模块,用于向用户返回所述依赖关系;
    指令接收模块,用于接收用户根据所述依赖关系下发的数据处理指令;
    触发模块,用于触发所述数据仓库对所述待查询数据执行所述数据处理指令。
  7. 根据权利要求6所述的装置,其特征在于,所述依赖关系确定模块具体包括:
    确定子模块,用于根据所述关键词确定所述待查询数据;
    依赖关系生成子模块,用于根据元数据生成所述待查询数据的依赖关系。
  8. 根据权利要求6所述的装置,其特征在于,所述元数据包括调度元数据、SQL 执行日志元数据、表结构元数据、同步中心元数据、定时任务元数据中的一个或多个。
  9. 根据权利要求6所述的装置,其特征在于,还包括:
    指令提供模块,用于根据所述依赖关系向用户提供针对所述待查询数据的数据处理指令。
  10. 根据权利要求6所述的装置,其特征在于,所述数据处理指令是下线或变更。
PCT/CN2016/083591 2015-06-04 2016-05-27 数据仓库的数据处理方法及装置 WO2016192583A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510303311.XA CN106294478B (zh) 2015-06-04 2015-06-04 数据仓库的数据处理方法及装置
CN201510303311.X 2015-06-04

Publications (1)

Publication Number Publication Date
WO2016192583A1 true WO2016192583A1 (zh) 2016-12-08

Family

ID=57440172

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/083591 WO2016192583A1 (zh) 2015-06-04 2016-05-27 数据仓库的数据处理方法及装置

Country Status (2)

Country Link
CN (1) CN106294478B (zh)
WO (1) WO2016192583A1 (zh)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110471949A (zh) * 2019-07-11 2019-11-19 阿里巴巴集团控股有限公司 数据血缘分析方法、装置、***、服务器及存储介质
CN110727677A (zh) * 2019-09-19 2020-01-24 上海数禾信息科技有限公司 数据仓库内表格的血缘关系追溯的方法和装置
CN113138973A (zh) * 2021-04-20 2021-07-20 建信金融科技有限责任公司 数据管理***及工作方法
CN113590610A (zh) * 2021-06-29 2021-11-02 四川新网银行股份有限公司 一种基于Elastic Search的血缘关系表示方法
CN113868253A (zh) * 2021-09-28 2021-12-31 中通服创立信息科技有限责任公司 一种数据关系捕获及大数据关系树构建方法
CN115470304A (zh) * 2022-08-31 2022-12-13 北京九章云极科技有限公司 一种特征因果仓库管理方法及***

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107391101B (zh) * 2017-04-21 2021-03-23 创新先进技术有限公司 一种信息处理方法及装置
CN110019384B (zh) * 2017-08-15 2023-06-27 阿里巴巴集团控股有限公司 一种血缘数据的获取方法、提供血缘数据的方法及装置
CN108764674B (zh) * 2018-05-16 2021-02-09 普信恒业科技发展(北京)有限公司 一种基于规则引擎的风险控制方法和装置
CN109308301A (zh) * 2018-09-28 2019-02-05 中国银行股份有限公司 测试数据的获得方法及装置
CN110297820B (zh) * 2019-06-28 2020-09-01 京东数字科技控股有限公司 一种数据处理方法、装置、设备和存储介质
CN111639062B (zh) * 2020-05-29 2023-07-28 京东方科技集团股份有限公司 一种数据仓库一键搭建的方法、***及存储介质
CN111930734B (zh) * 2020-08-11 2023-08-04 中国工商银行股份有限公司 基于任务和字段的数据下线方法及***
CN112433888B (zh) * 2020-12-02 2023-06-30 网易(杭州)网络有限公司 数据处理方法及装置、存储介质和电子设备
CN113486108A (zh) * 2021-07-06 2021-10-08 建信金融科技有限责任公司 一种数据处理方法、装置、电子设备及计算机可读介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1588369A (zh) * 2004-09-06 2005-03-02 杭州恒生电子股份有限公司 一种关系型数据库***及其查询和报表方法
CN101515290A (zh) * 2009-03-25 2009-08-26 中国工商银行股份有限公司 具有双向互动特征的元数据管理***及其实现方法
CN101685452A (zh) * 2008-09-26 2010-03-31 阿里巴巴集团控股有限公司 数据仓库调度方法及调度***
CN103778133A (zh) * 2012-10-18 2014-05-07 阿里巴巴集团控股有限公司 一种数据库对象的变更方法及装置
CN104199978A (zh) * 2014-09-24 2014-12-10 普元信息技术股份有限公司 基于NoSQL实现元数据缓存与分析的***及方法

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8200613B1 (en) * 2002-07-11 2012-06-12 Oracle International Corporation Approach for performing metadata reconciliation
CN102339298A (zh) * 2010-07-28 2012-02-01 ***通信集团公司 Sql脚本元数据的更新方法、装置及***
CN102880500B (zh) * 2011-07-13 2016-06-15 阿里巴巴集团控股有限公司 一种任务树的优化方法和装置
CN102508689A (zh) * 2011-11-08 2012-06-20 上海交通大学 高级语言程序数据流图提取中依赖关系保持数据处理***
US9665643B2 (en) * 2011-12-30 2017-05-30 Microsoft Technology Licensing, Llc Knowledge-based entity detection and disambiguation
GB2508573A (en) * 2012-02-28 2014-06-11 Qatar Foundation A computer-implemented method and computer program for detecting a set of inconsistent data records in a database including multiple records
CN103677753A (zh) * 2012-09-20 2014-03-26 艾默生零售解决方案公司 多任务控制方法、设备以及工业控制***
CN103870571B (zh) * 2014-03-14 2017-06-06 华为技术有限公司 多维联机分析处理***中的立方体重构方法和装置
CN104036034A (zh) * 2014-06-30 2014-09-10 百度在线网络技术(北京)有限公司 用于数据仓库的日志分析方法和装置
CN104268216A (zh) * 2014-09-24 2015-01-07 江苏名通信息科技有限公司 一种基于互联网信息的数据清洗***

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1588369A (zh) * 2004-09-06 2005-03-02 杭州恒生电子股份有限公司 一种关系型数据库***及其查询和报表方法
CN101685452A (zh) * 2008-09-26 2010-03-31 阿里巴巴集团控股有限公司 数据仓库调度方法及调度***
CN101515290A (zh) * 2009-03-25 2009-08-26 中国工商银行股份有限公司 具有双向互动特征的元数据管理***及其实现方法
CN103778133A (zh) * 2012-10-18 2014-05-07 阿里巴巴集团控股有限公司 一种数据库对象的变更方法及装置
CN104199978A (zh) * 2014-09-24 2014-12-10 普元信息技术股份有限公司 基于NoSQL实现元数据缓存与分析的***及方法

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110471949A (zh) * 2019-07-11 2019-11-19 阿里巴巴集团控股有限公司 数据血缘分析方法、装置、***、服务器及存储介质
CN110727677A (zh) * 2019-09-19 2020-01-24 上海数禾信息科技有限公司 数据仓库内表格的血缘关系追溯的方法和装置
CN110727677B (zh) * 2019-09-19 2022-12-30 上海数禾信息科技有限公司 数据仓库内表格的血缘关系追溯的方法和装置
CN113138973A (zh) * 2021-04-20 2021-07-20 建信金融科技有限责任公司 数据管理***及工作方法
CN113590610A (zh) * 2021-06-29 2021-11-02 四川新网银行股份有限公司 一种基于Elastic Search的血缘关系表示方法
CN113590610B (zh) * 2021-06-29 2023-06-20 四川新网银行股份有限公司 一种基于Elastic Search的血缘关系表示方法
CN113868253A (zh) * 2021-09-28 2021-12-31 中通服创立信息科技有限责任公司 一种数据关系捕获及大数据关系树构建方法
CN113868253B (zh) * 2021-09-28 2024-04-23 中通服创立信息科技有限责任公司 一种数据关系捕获及大数据关系树构建方法
CN115470304A (zh) * 2022-08-31 2022-12-13 北京九章云极科技有限公司 一种特征因果仓库管理方法及***
CN115470304B (zh) * 2022-08-31 2023-08-25 北京九章云极科技有限公司 一种特征因果仓库管理方法及***

Also Published As

Publication number Publication date
CN106294478B (zh) 2019-11-08
CN106294478A (zh) 2017-01-04

Similar Documents

Publication Publication Date Title
WO2016192583A1 (zh) 数据仓库的数据处理方法及装置
EP3475885B1 (en) System and method for dynamic, incremental recommendations within real-time visual simulation
US10534775B2 (en) Cardinality estimation for database query planning
US9996592B2 (en) Query relationship management
WO2019143705A1 (en) Dimension context propagation techniques for optimizing sql query plans
US9244971B1 (en) Data retrieval from heterogeneous storage systems
US8719271B2 (en) Accelerating data profiling process
CN110674358B (zh) 企业信息比对分析方法、装置、计算机设备及存储介质
EP3991044A1 (en) Diagnosing & triaging performance issues in large-scale services
Alexandrov et al. Issues in big data testing and benchmarking
US9037525B2 (en) Correlating data from multiple business processes to a business process scenario
US10489266B2 (en) Generating a visualization of a metric at one or multiple levels of execution of a database workload
US20140250121A1 (en) Translating business scenario definitions into corresponding database artifacts
JP2013531844A (ja) データマート自動化
EP3413214A1 (en) Selectivity estimation for database query planning
CN109753596B (zh) 用于大规模网络数据采集的信源管理与配置方法和***
US20160292233A1 (en) Discarding data points in a time series
US20220179873A1 (en) Data management platform, intelligent defect analysis system, intelligent defect analysis method, computer-program product, and method for defect analysis
WO2019228015A1 (zh) 基于移动端NoSQL数据库的索引创建方法及装置
US11132363B2 (en) Distributed computing framework and distributed computing method
US20140006000A1 (en) Built-in response time analytics for business applications
US20140136274A1 (en) Providing multiple level process intelligence and the ability to transition between levels
US20170161359A1 (en) Pattern-driven data generator
JP2007172516A (ja) Sql文によるデータベースの検索所要時間の予測方法及びプログラム
Kassela et al. Towards a Multi-engine Query Optimizer for Complex SQL Queries on Big Data

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16802506

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16802506

Country of ref document: EP

Kind code of ref document: A1