CN114546415A

CN114546415A - Big data storage optimization analysis system for cloud platform

Info

Publication number: CN114546415A
Application number: CN202210162512.2A
Authority: CN
Inventors: 袁建; 周子岩; 赵可
Original assignee: Huaneng Tendering Co ltd
Current assignee: Huaneng Tendering Co ltd
Priority date: 2022-02-22
Filing date: 2022-02-22
Publication date: 2022-05-27
Anticipated expiration: 2042-02-22
Also published as: CN114546415B

Abstract

The invention relates to the technical field of cloud platform big data, and discloses a big data storage optimization analysis system for a cloud platform, which comprises the following components: building a data integration platform; establishing a data warehouse; deploying a BI data analysis platform; the method is characterized in that a system bottom framework is set up, and different from a mechanism that a transaction type database system controls concurrent access through a lock mechanism, the GPDB ensures data consistency by using multi-version control, which means that when a database is queried, each transaction only sees a snapshot of data, and ensures that the current transaction does not see the modification of other transactions on the same record, so that transaction isolation is provided for each transaction of the database.

Description

Big data storage optimization analysis system for cloud platform

Technical Field

The invention relates to the technical field of cloud platform big data, in particular to a big data storage optimization analysis system for a cloud platform.

Background

MVCC is called multifersioncurrentycontrol in english, and chinese means a multi-version concurrency control technology, and the principle is to realize burst control of a database by managing multiple versions of a data line, which is simply to store a history version of data.

However, the MVCC has the problem of blocking between reading and writing, so that the concurrent data processing capacity is reduced, the lock for inquiring (reading) and the lock for writing conflict, and the probability of deadlock is increased.

Disclosure of Invention

Technical problem to be solved

Aiming at the defects of the prior art, the invention provides a large data storage optimization analysis system for a cloud platform, which aims to solve the problems in the background art.

(II) technical scheme

In order to achieve the purpose, the invention provides the following technical scheme: a big data storage optimization analysis system for a cloud platform comprises the following steps:

s1, building a data integration platform for implementing deployment by adopting an image-based ETL integration platform, wherein the data integration platform comprises a Kettle product and an image-based IPAS product;

s2, establishing a data warehouse for using an open source GreenPlum cluster as a bottom database and combining with a schema data warehouse solution;

s3, deploying a BI data analysis platform, and implementing deployment by adopting an image AG product;

and S4, building a system bottom framework, and using the view SEA2 enterprise computing platform as the system bottom framework.

Preferably, according to what is proposed in step S1, seven items are included, specifically as follows:

the first item: the data integration platform builds a target, integrates and integrates scattered business data, opens a data island, and integrates and stores data;

the second term is: the data integration platform is functionally realized, the platform is built by adopting an image-viewing ETL integration platform, the platform highly integrates KETTLE and image-viewing IPAS products, the platform supports three modes of JDBC database view, API (application program interface) interface and FILE to connect a production system and performs data connectivity verification, non-real-time batch access is performed by connecting relational data through the KETTLE, real-time data access is performed by connecting the IPAS with the API interface, and increment or full-volume data extraction can be freely selected when a script is developed according to the magnitude and the property of data;

the third item: unified data integration operation is performed, data integration work is completed in a data integration platform, a program does not need to be written additionally for data extraction work, the platform has a set of data integration standards and specifications, a client can complete a part of data integration work by himself after certain training, most of the data integration work is graphical operation, and data integration work is performed by low codes;

the fourth item: data source management, wherein a platform provides a complete data source management function, all information for establishing database connection is stored in a data source, and a user finds out corresponding database connection by providing a correct data source name;

the fifth item: data extraction, wherein the platform has abundant data extraction components, the cleaning, conversion and loading processes are covered comprehensively, and a user can flexibly match the combination of the components to complete the data extraction work;

the sixth item: data service, the platform provides data service for the business system by using modes of providing an API interface by using IPAS, providing a JDBC data view by using a database, providing a data file by using a KETTLE and the like;

the seventh item: the platform has an efficient scheduling function, after corresponding task scripts are developed, the platform can automatically perform the processes of data collection, data acquisition, data processing and data analysis according to the dependency relationship among tasks, and the platform calls a conversion script to combine with a task pool module, so that corresponding timestamps are acquired from the task pool before the tasks are executed, and source data are extracted according to the acquired timestamps;

preferably, according to the proposal in step S2, the following seven categories are included:

the first type: a data warehouse construction target is established, a data warehouse and a business data theme are established by building a multi-node GreenPlum distributed high-availability database, and a foundation is laid for cross-domain analysis and BI analysis;

the second type: the data warehouse hierarchy, namely, taking a GREENPLU high-availability cluster as data storage of a data warehouse bottom layer, constructing the data warehouse into three hierarchies of an ODS layer, a DW layer and a DM layer by adopting a mixed data warehouse layering method, taking the GREENPLU high-availability cluster as data storage of the data warehouse bottom layer, and constructing the data warehouse into three hierarchies of the ODS layer, the DW layer and the DM layer by adopting the mixed data warehouse layering method;

wherein:

ODS layer (operation data):

the ODS layer is mainly used for storing production system data, the original structure is kept unchanged as a whole, and a part of redundant data can be removed;

DW layer (datadomain data warehouse):

the DW layer mainly carries out further processing on the ODS layer data, divides the data into dimension data and fact data through data modeling, and simultaneously keeps the granularity approximately consistent with that of the ODS layer;

DM layer (DATAMARKET data mart):

the DM layer is mainly used for further abstracting and sublimating the DW layer data, strengthening the relation between data, compressing granularity and data quantity, improving the response speed of a system and reducing the load of the system;

the third type: the method comprises the following steps of building a bottom database, wherein a GreenPlum high-availability cluster has obvious advantages in capacity, expansibility, safety and response speed compared with a traditional single-instance database or a master-slave structure database, and the GreenPlum high-availability cluster is used as the bottom database in a data warehouse;

the fourth type: business analysis, wherein the business analysis work is an important link for building a data warehouse and is related to whether the later period of data in the data warehouse meets the requirements of an enterprise or not;

the fifth type: reasonable data layering, namely constructing a data warehouse by adopting a mixed data warehouse layering architecture method, wherein the architecture adopts a design method combining a CIF architecture and an MD architecture, and is implemented according to the basic architecture principle of loose coupling and layering, the basic idea is that the overall structure is CIF, namely the CIF is divided into an ODS layer, a DW layer and a DM layer, wherein the DW layer adopts the MD structure, and a fact table and a dimension table are used for constructing the DW layer;

the sixth type: the ODS layer converges service system data, the ODS layer is the layer closest to the data in the data source, and the data in the data source is extracted, cleaned and transmitted, and then loaded into the layer;

the data warehouse accesses the data increment or the whole quantity of the business systems such as SAP (ERP), DMS (distributor management system), WMS (warehouse management system), OA (office system), fee control system, EHR (human system), EAGLE (customer management system), DDI (flow direction interface data), LIMS (laboratory information management system), FONE (financial budget) and the like into the data warehouse at the layer, and provides support for the subsequent BI analysis and data service;

the seventh type: DW layer analysis dimension, wherein various data models are established according to subjects from data obtained from the ODS layer, and a data warehouse covers the analysis dimension of enterprise business on the layer, including but not limited to accounting subjects, cost centers, projects and WBS main data, distributors, products, materials, organizations, posts, employees, hospitals, DTP pharmacies, doctors, speakers, suppliers, clients, channels, jurisdictions, hospitals and warehouses;

preferably, according to what is proposed in step S3, the following are included:

1) establishing a target of a BI data analysis platform, combing business data of an enterprise by establishing and implementing a BI data warehouse and main data, and opening a dispersed data isolated island to form structured data assets, thereby supporting business transformation of the enterprise and realizing enterprise strategy;

2) the BI data analysis platform function architecture is implemented by using an image AG to build a BI data analysis platform, and the platform is divided into five types, specifically as follows:

managing data sources, wherein the data source management comprises multi-database support and connection pool management;

managing a data set, wherein the data set comprises dynamic SQL, support dragging, result previewing and support variable replacing;

managing assemblies, wherein the managing assemblies comprise dragging type development, enriching assemblies, secondary index calculation, self-defining indexes and style modification;

managing the instrument panel, wherein the management comprises linkage, jumping, drilling, screening configuration, free component layout and result preview;

and managing the system, wherein the management comprises role management, department management, user management and menu management.

Preferably, according to the data service set forth in the sixth aspect thereof, the categories include the following:

the method comprises the steps of firstly, configuring a message input PORT (PORT), wherein the PORT configuration is the core configuration of an IPAS interface, defines the corresponding relation among a transmission protocol, a data format, a dictionary and a business process, and manages services for configuring Web services and the like for external application access;

secondly, configuring message execution orchestration (COMMAND EDIATOR), and completing and coordinating the execution of a plurality of COMMANDs through the configuration of the COMMAND EDIATOR, thereby realizing the correct processing of the service data;

managing a message processing Command (Command), wherein the Command Command is a processing step of business data, comprises a plurality of types of business commands, selects a specific business processing Command and sets parameters required by the Command, so that the IPAS realizes a business target of a user, and currently supported Command types comprise commands of SAPIDOC file import, SapFaction access, mail sending, third-party WebAPI access, MSDYNAICS GP service access, Quickbooks service access, MySQL, Oracle, Postgresql database service access, remote server file upload and download service access and the like;

fourthly, file monitoring, file configuration monitoring and FTP file monitoring are carried out to enable the IPAS to automatically process the service data generated by the system, such as IDOC files generated by the SAP, and the service data uploaded to the FTP server by a user;

timing, wherein the timing management is to access the service of a third party by triggering a timing task;

parameter set, parameter set is the service of sharing COMMAND parameter content, for the function of message input PORT (PORT) configuration, because the COMMAND parameters between multiple APIs may be mostly the same, only a few parameters are different between each API, for the parameter content that multiple APIs can share, only need to configure a common parameter set for API to quote, the unique part of each API, continue to configure in each API's respective parameter list;

file uploading, which is used for uploading service of Mapping files;

the log is used for inquiring the relevant information of the IPAS interface;

ninthly, a mapping tool, a tool for generating a JSON format to JSON format mapping file.

Preferably, the business analysis proposed according to its sixth category comprises the following analysis items:

firstly, understanding indexes, dimensions and business meanings:

according to the understanding of the requirement document, the related indexes, dimensions and business meanings are known, and information such as the definition of the indexes, a calculation formula, the dimensions, a data display form, whether drilling exists in graphic display, detailed information display, a business module to which the graphic display belongs and the like is acquired;

secondly, defining a service range:

defining a service range, and acquiring a related system and a related module from requirements and system investigation;

thirdly, researching data sources of the business system:

and acquiring a data source of the service system, determining a docking mode, a data structure and a data dictionary of the docking data, and analyzing the data by combining the service.

(III) advantageous effects

Compared with the prior art, the invention provides a big data storage optimization analysis system for a cloud platform, which has the following beneficial effects:

the system adopts an open source GreenPlum distributed database as a storage calculation engine at the bottom layer, each processing unit has a private CPU/memory/hard disk and the like, shared resources do not exist, the processing units are communicated through a protocol, the parallel processing and expansion capabilities are better, each node is mutually independent and processes own data, the processed results can be summarized to the upper layer or can be circulated among the nodes, and a Share-Nothing architecture has obvious advantages in expansibility and cost;

the massive parallel processing system consists of a plurality of loosely-coupled processing units, and by means of a high-performance system architecture such as MPP (maximum power point), Greenplus can decompose the load of a data warehouse at the TB level and process a single query in parallel by using all system resources;

unlike mechanisms in which transactional database systems control concurrent access through a lock mechanism, GPDBs use multi-version control (multi versioncurrencycontrol/MVCC) to ensure data consistency, which means that when querying a database, each transaction sees only a snapshot of the data, which ensures that the current transaction does not see modifications of other transactions on the same record, thereby providing transaction isolation for each transaction of the database, the greatest advantage of using MVCC instead of the lock mechanism in terms of burst control is that the lock of the query (read) does not conflict with the lock of the write by MVCC, and the read and write do not block each other.

Drawings

FIG. 1 is a flow chart of data extraction according to the present invention;

FIG. 2 is a flow chart of task scheduling of the present invention;

FIG. 3 is a schematic view of a data warehouse hierarchy in accordance with the present invention;

FIG. 4 is a flow chart of the functional architecture of the data analysis platform according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention provides a technical scheme, and discloses a large data storage optimization analysis system for a cloud platform, which comprises the following steps as shown in fig. 1 to 4:

Preferably, the method includes seven items according to the method set forth in step S1, specifically as follows:

wherein:

ODS layer (operation data):

DW layer (datadomain data warehouse):

DM layer (DATAMARKET data mart):

the third type: the method comprises the following steps of building a bottom database, wherein a GreenPlum high-availability cluster has obvious advantages in capacity, expansibility, safety and response speed compared with a traditional single-instance database or a master-slave structure database, and the GreenPlum high-availability cluster is adopted as the bottom database in a data warehouse;

fifthly, system management, including role management, department management, user management and menu management,

secondly, configuring message execution orchestration (COMMAND EDIATOR), and finishing coordinating the execution of a plurality of COMMANDs through the configuration of the COMMAND EDIATOR, thereby realizing the correct processing of the service data;

file uploading, which is used for uploading service of Mapping files;

the log is used for inquiring the relevant information of the IPAS interface;

firstly, the indexes, the dimensions and the business meanings are known:

secondly, defining a service range:

thirdly, researching data sources of the business system:

The working principle of the device is as follows: the system adopts an open-source GreenPlum distributed database as a bottom storage calculation engine, the GreenPlum is an open-source big data platform based on a database distributed architecture, adopts a non-shared (nonsharing) MPP architecture, has good linear expansion capability, has the characteristics of efficient parallel operation, parallel storage and the like, has a unique and efficient ORCA optimizer, is compatible with SQL grammar, is suitable for the storage, processing and real-time analysis capability of high-efficiency PB data magnitude, supports and covers OLTP type service mixed load due to the fact that an inner core is based on a PostgreSQL database, has backup nodes for providing high availability of the database, is more suitable for being used as a storage, calculation and analysis engine of structured big data compared with Hadoop, and has the following characteristics when being matched with the system for use:

sharednotch: each processing unit has a private CPU/memory/hard disk and the like, shared resources do not exist, the processing units are communicated through a protocol, the parallel processing and expansion capabilities are better, each node is independent, the processing units process own data respectively, the processed results can be summarized to an upper layer or are circulated among the nodes, and the SHARE-NOTING framework has obvious advantages in expansibility and cost;

MPP: the large-scale parallel processing system consists of a plurality of loosely-coupled processing units, and by means of a high-performance system architecture such as MPP (maximum power point), GREENPLUM can decompose the load of a data warehouse at a TB level and process a single query in parallel by using all system resources;

MVCC: unlike mechanisms where transactional database systems control concurrent access through a lock mechanism, GPDBs use multi-version control (multi-version control/MVCC) to guarantee data consistency, which means that when querying a database, each transaction sees only a snapshot of the data, which ensures that the current transaction does not see modifications of other transactions on the same record, thereby providing transaction isolation for each transaction of the database, the greatest advantage of using MVCC instead of the lock mechanism in terms of burst control is that MVCC does not conflict with the lock of the query (read) and the lock of the write, and there is no mutual blocking between the read and the write.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. The utility model provides a be used for cloud platform big data storage optimization analysis system which characterized in that: the method comprises the following steps:

2. The big data storage optimization analysis system for the cloud platform according to claim 1, wherein: according to the proposal of the step S1, seven items are included, specifically as follows:

the seventh item: the platform has an efficient scheduling function, after corresponding task scripts are developed, the platform can automatically perform data collection, data acquisition, data processing and data analysis according to the dependency relationship among tasks, the conversion script is called to be combined with the task pool module, corresponding timestamps are obtained from the task pool before the tasks are executed, and source data are extracted according to the obtained timestamps.

3. The big data storage optimization analysis system for the cloud platform according to claim 1, wherein: as set forth in step S2, the following seven categories are included:

wherein:

ODS layer (operation data):

DW layer (datadomain data warehouse):

DM layer (DATAMARKET data mart):

the seventh type: DW layer analysis dimension, establishing various data models according to subjects from data obtained from ODS layer, and covering the analysis dimension of enterprise business in the data warehouse at the layer, including but not limited to accounting subjects, cost centers, project and WBS main data, distributors, products, materials, organizations, posts, employees, hospitals, DTP pharmacies, doctors, speakers, suppliers, clients, channels, jurisdictions, hospitals and warehouses.

4. The big data storage optimization analysis system for the cloud platform according to claim 1, wherein: according to what is proposed in step S3, the following are included:

2) the functional architecture of the BI data analysis platform is implemented by using a pictorial AG (object Access gateway), and the BI data analysis platform is divided into five types, specifically as follows:

5. The big data storage optimization analysis system for the cloud platform according to claim 2, wherein: according to the data service set forth in the sixth aspect thereof, the categories include the following:

file uploading, which is used for uploading service of Mapping files;

the log is used for inquiring the relevant information of the IPAS interface;

6. The big data storage optimization analysis system for the cloud platform according to claim 3, wherein: the business analysis proposed according to its sixth category, contains the following analysis items:

firstly, understanding indexes, dimensions and business meanings:

secondly, defining a service range:

thirdly, researching data sources of the business system: