CN114546415A - Big data storage optimization analysis system for cloud platform - Google Patents

Big data storage optimization analysis system for cloud platform Download PDF

Info

Publication number
CN114546415A
CN114546415A CN202210162512.2A CN202210162512A CN114546415A CN 114546415 A CN114546415 A CN 114546415A CN 202210162512 A CN202210162512 A CN 202210162512A CN 114546415 A CN114546415 A CN 114546415A
Authority
CN
China
Prior art keywords
data
layer
platform
business
service
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210162512.2A
Other languages
Chinese (zh)
Other versions
CN114546415B (en
Inventor
袁建
周子岩
赵可
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huaneng Tendering Co ltd
Original Assignee
Huaneng Tendering Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huaneng Tendering Co ltd filed Critical Huaneng Tendering Co ltd
Priority to CN202210162512.2A priority Critical patent/CN114546415B/en
Publication of CN114546415A publication Critical patent/CN114546415A/en
Application granted granted Critical
Publication of CN114546415B publication Critical patent/CN114546415B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/60Software deployment
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/252Integrating or interfacing systems involving database management systems between a Database Management System and a front-end application
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/445Program loading or initiating
    • G06F9/44505Configuring for program initiating, e.g. using registry, configuration files
    • G06F9/4451User profiles; Roaming

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of cloud platform big data, and discloses a big data storage optimization analysis system for a cloud platform, which comprises the following components: building a data integration platform; establishing a data warehouse; deploying a BI data analysis platform; the method is characterized in that a system bottom framework is set up, and different from a mechanism that a transaction type database system controls concurrent access through a lock mechanism, the GPDB ensures data consistency by using multi-version control, which means that when a database is queried, each transaction only sees a snapshot of data, and ensures that the current transaction does not see the modification of other transactions on the same record, so that transaction isolation is provided for each transaction of the database.

Description

Big data storage optimization analysis system for cloud platform
Technical Field
The invention relates to the technical field of cloud platform big data, in particular to a big data storage optimization analysis system for a cloud platform.
Background
MVCC is called multifersioncurrentycontrol in english, and chinese means a multi-version concurrency control technology, and the principle is to realize burst control of a database by managing multiple versions of a data line, which is simply to store a history version of data.
However, the MVCC has the problem of blocking between reading and writing, so that the concurrent data processing capacity is reduced, the lock for inquiring (reading) and the lock for writing conflict, and the probability of deadlock is increased.
Disclosure of Invention
Technical problem to be solved
Aiming at the defects of the prior art, the invention provides a large data storage optimization analysis system for a cloud platform, which aims to solve the problems in the background art.
(II) technical scheme
In order to achieve the purpose, the invention provides the following technical scheme: a big data storage optimization analysis system for a cloud platform comprises the following steps:
s1, building a data integration platform for implementing deployment by adopting an image-based ETL integration platform, wherein the data integration platform comprises a Kettle product and an image-based IPAS product;
s2, establishing a data warehouse for using an open source GreenPlum cluster as a bottom database and combining with a schema data warehouse solution;
s3, deploying a BI data analysis platform, and implementing deployment by adopting an image AG product;
and S4, building a system bottom framework, and using the view SEA2 enterprise computing platform as the system bottom framework.
Preferably, according to what is proposed in step S1, seven items are included, specifically as follows:
the first item: the data integration platform builds a target, integrates and integrates scattered business data, opens a data island, and integrates and stores data;
the second term is: the data integration platform is functionally realized, the platform is built by adopting an image-viewing ETL integration platform, the platform highly integrates KETTLE and image-viewing IPAS products, the platform supports three modes of JDBC database view, API (application program interface) interface and FILE to connect a production system and performs data connectivity verification, non-real-time batch access is performed by connecting relational data through the KETTLE, real-time data access is performed by connecting the IPAS with the API interface, and increment or full-volume data extraction can be freely selected when a script is developed according to the magnitude and the property of data;
the third item: unified data integration operation is performed, data integration work is completed in a data integration platform, a program does not need to be written additionally for data extraction work, the platform has a set of data integration standards and specifications, a client can complete a part of data integration work by himself after certain training, most of the data integration work is graphical operation, and data integration work is performed by low codes;
the fourth item: data source management, wherein a platform provides a complete data source management function, all information for establishing database connection is stored in a data source, and a user finds out corresponding database connection by providing a correct data source name;
the fifth item: data extraction, wherein the platform has abundant data extraction components, the cleaning, conversion and loading processes are covered comprehensively, and a user can flexibly match the combination of the components to complete the data extraction work;
the sixth item: data service, the platform provides data service for the business system by using modes of providing an API interface by using IPAS, providing a JDBC data view by using a database, providing a data file by using a KETTLE and the like;
the seventh item: the platform has an efficient scheduling function, after corresponding task scripts are developed, the platform can automatically perform the processes of data collection, data acquisition, data processing and data analysis according to the dependency relationship among tasks, and the platform calls a conversion script to combine with a task pool module, so that corresponding timestamps are acquired from the task pool before the tasks are executed, and source data are extracted according to the acquired timestamps;
preferably, according to the proposal in step S2, the following seven categories are included:
the first type: a data warehouse construction target is established, a data warehouse and a business data theme are established by building a multi-node GreenPlum distributed high-availability database, and a foundation is laid for cross-domain analysis and BI analysis;
the second type: the data warehouse hierarchy, namely, taking a GREENPLU high-availability cluster as data storage of a data warehouse bottom layer, constructing the data warehouse into three hierarchies of an ODS layer, a DW layer and a DM layer by adopting a mixed data warehouse layering method, taking the GREENPLU high-availability cluster as data storage of the data warehouse bottom layer, and constructing the data warehouse into three hierarchies of the ODS layer, the DW layer and the DM layer by adopting the mixed data warehouse layering method;
wherein:
ODS layer (operation data):
the ODS layer is mainly used for storing production system data, the original structure is kept unchanged as a whole, and a part of redundant data can be removed;
DW layer (datadomain data warehouse):
the DW layer mainly carries out further processing on the ODS layer data, divides the data into dimension data and fact data through data modeling, and simultaneously keeps the granularity approximately consistent with that of the ODS layer;
DM layer (DATAMARKET data mart):
the DM layer is mainly used for further abstracting and sublimating the DW layer data, strengthening the relation between data, compressing granularity and data quantity, improving the response speed of a system and reducing the load of the system;
the third type: the method comprises the following steps of building a bottom database, wherein a GreenPlum high-availability cluster has obvious advantages in capacity, expansibility, safety and response speed compared with a traditional single-instance database or a master-slave structure database, and the GreenPlum high-availability cluster is used as the bottom database in a data warehouse;
the fourth type: business analysis, wherein the business analysis work is an important link for building a data warehouse and is related to whether the later period of data in the data warehouse meets the requirements of an enterprise or not;
the fifth type: reasonable data layering, namely constructing a data warehouse by adopting a mixed data warehouse layering architecture method, wherein the architecture adopts a design method combining a CIF architecture and an MD architecture, and is implemented according to the basic architecture principle of loose coupling and layering, the basic idea is that the overall structure is CIF, namely the CIF is divided into an ODS layer, a DW layer and a DM layer, wherein the DW layer adopts the MD structure, and a fact table and a dimension table are used for constructing the DW layer;
the sixth type: the ODS layer converges service system data, the ODS layer is the layer closest to the data in the data source, and the data in the data source is extracted, cleaned and transmitted, and then loaded into the layer;
the data warehouse accesses the data increment or the whole quantity of the business systems such as SAP (ERP), DMS (distributor management system), WMS (warehouse management system), OA (office system), fee control system, EHR (human system), EAGLE (customer management system), DDI (flow direction interface data), LIMS (laboratory information management system), FONE (financial budget) and the like into the data warehouse at the layer, and provides support for the subsequent BI analysis and data service;
the seventh type: DW layer analysis dimension, wherein various data models are established according to subjects from data obtained from the ODS layer, and a data warehouse covers the analysis dimension of enterprise business on the layer, including but not limited to accounting subjects, cost centers, projects and WBS main data, distributors, products, materials, organizations, posts, employees, hospitals, DTP pharmacies, doctors, speakers, suppliers, clients, channels, jurisdictions, hospitals and warehouses;
preferably, according to what is proposed in step S3, the following are included:
1) establishing a target of a BI data analysis platform, combing business data of an enterprise by establishing and implementing a BI data warehouse and main data, and opening a dispersed data isolated island to form structured data assets, thereby supporting business transformation of the enterprise and realizing enterprise strategy;
2) the BI data analysis platform function architecture is implemented by using an image AG to build a BI data analysis platform, and the platform is divided into five types, specifically as follows:
managing data sources, wherein the data source management comprises multi-database support and connection pool management;
managing a data set, wherein the data set comprises dynamic SQL, support dragging, result previewing and support variable replacing;
managing assemblies, wherein the managing assemblies comprise dragging type development, enriching assemblies, secondary index calculation, self-defining indexes and style modification;
managing the instrument panel, wherein the management comprises linkage, jumping, drilling, screening configuration, free component layout and result preview;
and managing the system, wherein the management comprises role management, department management, user management and menu management.
Preferably, according to the data service set forth in the sixth aspect thereof, the categories include the following:
the method comprises the steps of firstly, configuring a message input PORT (PORT), wherein the PORT configuration is the core configuration of an IPAS interface, defines the corresponding relation among a transmission protocol, a data format, a dictionary and a business process, and manages services for configuring Web services and the like for external application access;
secondly, configuring message execution orchestration (COMMAND EDIATOR), and completing and coordinating the execution of a plurality of COMMANDs through the configuration of the COMMAND EDIATOR, thereby realizing the correct processing of the service data;
managing a message processing Command (Command), wherein the Command Command is a processing step of business data, comprises a plurality of types of business commands, selects a specific business processing Command and sets parameters required by the Command, so that the IPAS realizes a business target of a user, and currently supported Command types comprise commands of SAPIDOC file import, SapFaction access, mail sending, third-party WebAPI access, MSDYNAICS GP service access, Quickbooks service access, MySQL, Oracle, Postgresql database service access, remote server file upload and download service access and the like;
fourthly, file monitoring, file configuration monitoring and FTP file monitoring are carried out to enable the IPAS to automatically process the service data generated by the system, such as IDOC files generated by the SAP, and the service data uploaded to the FTP server by a user;
timing, wherein the timing management is to access the service of a third party by triggering a timing task;
parameter set, parameter set is the service of sharing COMMAND parameter content, for the function of message input PORT (PORT) configuration, because the COMMAND parameters between multiple APIs may be mostly the same, only a few parameters are different between each API, for the parameter content that multiple APIs can share, only need to configure a common parameter set for API to quote, the unique part of each API, continue to configure in each API's respective parameter list;
file uploading, which is used for uploading service of Mapping files;
the log is used for inquiring the relevant information of the IPAS interface;
ninthly, a mapping tool, a tool for generating a JSON format to JSON format mapping file.
Preferably, the business analysis proposed according to its sixth category comprises the following analysis items:
firstly, understanding indexes, dimensions and business meanings:
according to the understanding of the requirement document, the related indexes, dimensions and business meanings are known, and information such as the definition of the indexes, a calculation formula, the dimensions, a data display form, whether drilling exists in graphic display, detailed information display, a business module to which the graphic display belongs and the like is acquired;
secondly, defining a service range:
defining a service range, and acquiring a related system and a related module from requirements and system investigation;
thirdly, researching data sources of the business system:
and acquiring a data source of the service system, determining a docking mode, a data structure and a data dictionary of the docking data, and analyzing the data by combining the service.
(III) advantageous effects
Compared with the prior art, the invention provides a big data storage optimization analysis system for a cloud platform, which has the following beneficial effects:
the system adopts an open source GreenPlum distributed database as a storage calculation engine at the bottom layer, each processing unit has a private CPU/memory/hard disk and the like, shared resources do not exist, the processing units are communicated through a protocol, the parallel processing and expansion capabilities are better, each node is mutually independent and processes own data, the processed results can be summarized to the upper layer or can be circulated among the nodes, and a Share-Nothing architecture has obvious advantages in expansibility and cost;
the massive parallel processing system consists of a plurality of loosely-coupled processing units, and by means of a high-performance system architecture such as MPP (maximum power point), Greenplus can decompose the load of a data warehouse at the TB level and process a single query in parallel by using all system resources;
unlike mechanisms in which transactional database systems control concurrent access through a lock mechanism, GPDBs use multi-version control (multi versioncurrencycontrol/MVCC) to ensure data consistency, which means that when querying a database, each transaction sees only a snapshot of the data, which ensures that the current transaction does not see modifications of other transactions on the same record, thereby providing transaction isolation for each transaction of the database, the greatest advantage of using MVCC instead of the lock mechanism in terms of burst control is that the lock of the query (read) does not conflict with the lock of the write by MVCC, and the read and write do not block each other.
Drawings
FIG. 1 is a flow chart of data extraction according to the present invention;
FIG. 2 is a flow chart of task scheduling of the present invention;
FIG. 3 is a schematic view of a data warehouse hierarchy in accordance with the present invention;
FIG. 4 is a flow chart of the functional architecture of the data analysis platform according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention provides a technical scheme, and discloses a large data storage optimization analysis system for a cloud platform, which comprises the following steps as shown in fig. 1 to 4:
s1, building a data integration platform for implementing deployment by adopting an image-based ETL integration platform, wherein the data integration platform comprises a Kettle product and an image-based IPAS product;
s2, establishing a data warehouse for using an open source GreenPlum cluster as a bottom database and combining with a schema data warehouse solution;
s3, deploying a BI data analysis platform, and implementing deployment by adopting an image AG product;
and S4, building a system bottom framework, and using the view SEA2 enterprise computing platform as the system bottom framework.
Preferably, the method includes seven items according to the method set forth in step S1, specifically as follows:
the first item: the data integration platform builds a target, integrates and integrates scattered business data, opens a data island, and integrates and stores data;
the second term is: the data integration platform is functionally realized, the platform is built by adopting an image-viewing ETL integration platform, the platform highly integrates KETTLE and image-viewing IPAS products, the platform supports three modes of JDBC database view, API (application program interface) interface and FILE to connect a production system and performs data connectivity verification, non-real-time batch access is performed by connecting relational data through the KETTLE, real-time data access is performed by connecting the IPAS with the API interface, and increment or full-volume data extraction can be freely selected when a script is developed according to the magnitude and the property of data;
the third item: unified data integration operation is performed, data integration work is completed in a data integration platform, a program does not need to be written additionally for data extraction work, the platform has a set of data integration standards and specifications, a client can complete a part of data integration work by himself after certain training, most of the data integration work is graphical operation, and data integration work is performed by low codes;
the fourth item: data source management, wherein a platform provides a complete data source management function, all information for establishing database connection is stored in a data source, and a user finds out corresponding database connection by providing a correct data source name;
the fifth item: data extraction, wherein the platform has abundant data extraction components, the cleaning, conversion and loading processes are covered comprehensively, and a user can flexibly match the combination of the components to complete the data extraction work;
the sixth item: data service, the platform provides data service for the business system by using modes of providing an API interface by using IPAS, providing a JDBC data view by using a database, providing a data file by using a KETTLE and the like;
the seventh item: the platform has an efficient scheduling function, after corresponding task scripts are developed, the platform can automatically perform the processes of data collection, data acquisition, data processing and data analysis according to the dependency relationship among tasks, and the platform calls a conversion script to combine with a task pool module, so that corresponding timestamps are acquired from the task pool before the tasks are executed, and source data are extracted according to the acquired timestamps;
preferably, according to the proposal in step S2, the following seven categories are included:
the first type: a data warehouse construction target is established, a data warehouse and a business data theme are established by building a multi-node GreenPlum distributed high-availability database, and a foundation is laid for cross-domain analysis and BI analysis;
the second type: the data warehouse hierarchy, namely, taking a GREENPLU high-availability cluster as data storage of a data warehouse bottom layer, constructing the data warehouse into three hierarchies of an ODS layer, a DW layer and a DM layer by adopting a mixed data warehouse layering method, taking the GREENPLU high-availability cluster as data storage of the data warehouse bottom layer, and constructing the data warehouse into three hierarchies of the ODS layer, the DW layer and the DM layer by adopting the mixed data warehouse layering method;
wherein:
ODS layer (operation data):
the ODS layer is mainly used for storing production system data, the original structure is kept unchanged as a whole, and a part of redundant data can be removed;
DW layer (datadomain data warehouse):
the DW layer mainly carries out further processing on the ODS layer data, divides the data into dimension data and fact data through data modeling, and simultaneously keeps the granularity approximately consistent with that of the ODS layer;
DM layer (DATAMARKET data mart):
the DM layer is mainly used for further abstracting and sublimating the DW layer data, strengthening the relation between data, compressing granularity and data quantity, improving the response speed of a system and reducing the load of the system;
the third type: the method comprises the following steps of building a bottom database, wherein a GreenPlum high-availability cluster has obvious advantages in capacity, expansibility, safety and response speed compared with a traditional single-instance database or a master-slave structure database, and the GreenPlum high-availability cluster is adopted as the bottom database in a data warehouse;
the fourth type: business analysis, wherein the business analysis work is an important link for building a data warehouse and is related to whether the later period of data in the data warehouse meets the requirements of an enterprise or not;
the fifth type: reasonable data layering, namely constructing a data warehouse by adopting a mixed data warehouse layering architecture method, wherein the architecture adopts a design method combining a CIF architecture and an MD architecture, and is implemented according to the basic architecture principle of loose coupling and layering, the basic idea is that the overall structure is CIF, namely the CIF is divided into an ODS layer, a DW layer and a DM layer, wherein the DW layer adopts the MD structure, and a fact table and a dimension table are used for constructing the DW layer;
the sixth type: the ODS layer converges service system data, the ODS layer is the layer closest to the data in the data source, and the data in the data source is extracted, cleaned and transmitted, and then loaded into the layer;
the data warehouse accesses the data increment or the whole quantity of the business systems such as SAP (ERP), DMS (distributor management system), WMS (warehouse management system), OA (office system), fee control system, EHR (human system), EAGLE (customer management system), DDI (flow direction interface data), LIMS (laboratory information management system), FONE (financial budget) and the like into the data warehouse at the layer, and provides support for the subsequent BI analysis and data service;
the seventh type: DW layer analysis dimension, wherein various data models are established according to subjects from data obtained from the ODS layer, and a data warehouse covers the analysis dimension of enterprise business on the layer, including but not limited to accounting subjects, cost centers, projects and WBS main data, distributors, products, materials, organizations, posts, employees, hospitals, DTP pharmacies, doctors, speakers, suppliers, clients, channels, jurisdictions, hospitals and warehouses;
preferably, according to what is proposed in step S3, the following are included:
1) establishing a target of a BI data analysis platform, combing business data of an enterprise by establishing and implementing a BI data warehouse and main data, and opening a dispersed data isolated island to form structured data assets, thereby supporting business transformation of the enterprise and realizing enterprise strategy;
2) the BI data analysis platform function architecture is implemented by using an image AG to build a BI data analysis platform, and the platform is divided into five types, specifically as follows:
managing data sources, wherein the data source management comprises multi-database support and connection pool management;
managing a data set, wherein the data set comprises dynamic SQL, support dragging, result previewing and support variable replacing;
managing assemblies, wherein the managing assemblies comprise dragging type development, enriching assemblies, secondary index calculation, self-defining indexes and style modification;
managing the instrument panel, wherein the management comprises linkage, jumping, drilling, screening configuration, free component layout and result preview;
fifthly, system management, including role management, department management, user management and menu management,
preferably, according to the data service set forth in the sixth aspect thereof, the categories include the following:
the method comprises the steps of firstly, configuring a message input PORT (PORT), wherein the PORT configuration is the core configuration of an IPAS interface, defines the corresponding relation among a transmission protocol, a data format, a dictionary and a business process, and manages services for configuring Web services and the like for external application access;
secondly, configuring message execution orchestration (COMMAND EDIATOR), and finishing coordinating the execution of a plurality of COMMANDs through the configuration of the COMMAND EDIATOR, thereby realizing the correct processing of the service data;
managing a message processing Command (Command), wherein the Command Command is a processing step of business data, comprises a plurality of types of business commands, selects a specific business processing Command and sets parameters required by the Command, so that the IPAS realizes a business target of a user, and currently supported Command types comprise commands of SAPIDOC file import, SapFaction access, mail sending, third-party WebAPI access, MSDYNAICS GP service access, Quickbooks service access, MySQL, Oracle, Postgresql database service access, remote server file upload and download service access and the like;
fourthly, file monitoring, file configuration monitoring and FTP file monitoring are carried out to enable the IPAS to automatically process the service data generated by the system, such as IDOC files generated by the SAP, and the service data uploaded to the FTP server by a user;
timing, wherein the timing management is to access the service of a third party by triggering a timing task;
parameter set, parameter set is the service of sharing COMMAND parameter content, for the function of message input PORT (PORT) configuration, because the COMMAND parameters between multiple APIs may be mostly the same, only a few parameters are different between each API, for the parameter content that multiple APIs can share, only need to configure a common parameter set for API to quote, the unique part of each API, continue to configure in each API's respective parameter list;
file uploading, which is used for uploading service of Mapping files;
the log is used for inquiring the relevant information of the IPAS interface;
ninthly, a mapping tool, a tool for generating a JSON format to JSON format mapping file.
Preferably, the business analysis proposed according to its sixth category comprises the following analysis items:
firstly, the indexes, the dimensions and the business meanings are known:
according to the understanding of the requirement document, the related indexes, dimensions and business meanings are known, and information such as the definition of the indexes, a calculation formula, the dimensions, a data display form, whether drilling exists in graphic display, detailed information display, a business module to which the graphic display belongs and the like is acquired;
secondly, defining a service range:
defining a service range, and acquiring a related system and a related module from requirements and system investigation;
thirdly, researching data sources of the business system:
and acquiring a data source of the service system, determining a docking mode, a data structure and a data dictionary of the docking data, and analyzing the data by combining the service.
The working principle of the device is as follows: the system adopts an open-source GreenPlum distributed database as a bottom storage calculation engine, the GreenPlum is an open-source big data platform based on a database distributed architecture, adopts a non-shared (nonsharing) MPP architecture, has good linear expansion capability, has the characteristics of efficient parallel operation, parallel storage and the like, has a unique and efficient ORCA optimizer, is compatible with SQL grammar, is suitable for the storage, processing and real-time analysis capability of high-efficiency PB data magnitude, supports and covers OLTP type service mixed load due to the fact that an inner core is based on a PostgreSQL database, has backup nodes for providing high availability of the database, is more suitable for being used as a storage, calculation and analysis engine of structured big data compared with Hadoop, and has the following characteristics when being matched with the system for use:
sharednotch: each processing unit has a private CPU/memory/hard disk and the like, shared resources do not exist, the processing units are communicated through a protocol, the parallel processing and expansion capabilities are better, each node is independent, the processing units process own data respectively, the processed results can be summarized to an upper layer or are circulated among the nodes, and the SHARE-NOTING framework has obvious advantages in expansibility and cost;
MPP: the large-scale parallel processing system consists of a plurality of loosely-coupled processing units, and by means of a high-performance system architecture such as MPP (maximum power point), GREENPLUM can decompose the load of a data warehouse at a TB level and process a single query in parallel by using all system resources;
MVCC: unlike mechanisms where transactional database systems control concurrent access through a lock mechanism, GPDBs use multi-version control (multi-version control/MVCC) to guarantee data consistency, which means that when querying a database, each transaction sees only a snapshot of the data, which ensures that the current transaction does not see modifications of other transactions on the same record, thereby providing transaction isolation for each transaction of the database, the greatest advantage of using MVCC instead of the lock mechanism in terms of burst control is that MVCC does not conflict with the lock of the query (read) and the lock of the write, and there is no mutual blocking between the read and the write.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims (6)

1. The utility model provides a be used for cloud platform big data storage optimization analysis system which characterized in that: the method comprises the following steps:
s1, building a data integration platform for implementing deployment by adopting an image-based ETL integration platform, wherein the data integration platform comprises a Kettle product and an image-based IPAS product;
s2, establishing a data warehouse for using an open source GreenPlum cluster as a bottom database and combining with a schema data warehouse solution;
s3, deploying a BI data analysis platform, and implementing deployment by adopting an image AG product;
and S4, building a system bottom framework, and using the view SEA2 enterprise computing platform as the system bottom framework.
2. The big data storage optimization analysis system for the cloud platform according to claim 1, wherein: according to the proposal of the step S1, seven items are included, specifically as follows:
the first item: the data integration platform builds a target, integrates and integrates scattered business data, opens a data island, and integrates and stores data;
the second term is: the data integration platform is functionally realized, the platform is built by adopting an image-viewing ETL integration platform, the platform highly integrates KETTLE and image-viewing IPAS products, the platform supports three modes of JDBC database view, API (application program interface) interface and FILE to connect a production system and performs data connectivity verification, non-real-time batch access is performed by connecting relational data through the KETTLE, real-time data access is performed by connecting the IPAS with the API interface, and increment or full-volume data extraction can be freely selected when a script is developed according to the magnitude and the property of data;
the third item: unified data integration operation is performed, data integration work is completed in a data integration platform, a program does not need to be written additionally for data extraction work, the platform has a set of data integration standards and specifications, a client can complete a part of data integration work by himself after certain training, most of the data integration work is graphical operation, and data integration work is performed by low codes;
the fourth item: data source management, wherein a platform provides a complete data source management function, all information for establishing database connection is stored in a data source, and a user finds out corresponding database connection by providing a correct data source name;
the fifth item: data extraction, wherein the platform has abundant data extraction components, the cleaning, conversion and loading processes are covered comprehensively, and a user can flexibly match the combination of the components to complete the data extraction work;
the sixth item: data service, the platform provides data service for the business system by using modes of providing an API interface by using IPAS, providing a JDBC data view by using a database, providing a data file by using a KETTLE and the like;
the seventh item: the platform has an efficient scheduling function, after corresponding task scripts are developed, the platform can automatically perform data collection, data acquisition, data processing and data analysis according to the dependency relationship among tasks, the conversion script is called to be combined with the task pool module, corresponding timestamps are obtained from the task pool before the tasks are executed, and source data are extracted according to the obtained timestamps.
3. The big data storage optimization analysis system for the cloud platform according to claim 1, wherein: as set forth in step S2, the following seven categories are included:
the first type: a data warehouse construction target is established, a data warehouse and a business data theme are established by building a multi-node GreenPlum distributed high-availability database, and a foundation is laid for cross-domain analysis and BI analysis;
the second type: the data warehouse hierarchy, namely, taking a GREENPLU high-availability cluster as data storage of a data warehouse bottom layer, constructing the data warehouse into three hierarchies of an ODS layer, a DW layer and a DM layer by adopting a mixed data warehouse layering method, taking the GREENPLU high-availability cluster as data storage of the data warehouse bottom layer, and constructing the data warehouse into three hierarchies of the ODS layer, the DW layer and the DM layer by adopting the mixed data warehouse layering method;
wherein:
ODS layer (operation data):
the ODS layer is mainly used for storing production system data, the original structure is kept unchanged as a whole, and a part of redundant data can be removed;
DW layer (datadomain data warehouse):
the DW layer mainly carries out further processing on the ODS layer data, divides the data into dimension data and fact data through data modeling, and simultaneously keeps the granularity approximately consistent with that of the ODS layer;
DM layer (DATAMARKET data mart):
the DM layer is mainly used for further abstracting and sublimating the DW layer data, strengthening the relation between data, compressing granularity and data quantity, improving the response speed of a system and reducing the load of the system;
the third type: the method comprises the following steps of building a bottom database, wherein a GreenPlum high-availability cluster has obvious advantages in capacity, expansibility, safety and response speed compared with a traditional single-instance database or a master-slave structure database, and the GreenPlum high-availability cluster is adopted as the bottom database in a data warehouse;
the fourth type: business analysis, wherein the business analysis work is an important link for building a data warehouse and is related to whether the later period of data in the data warehouse meets the requirements of an enterprise or not;
the fifth type: reasonable data layering, namely constructing a data warehouse by adopting a mixed data warehouse layering architecture method, wherein the architecture adopts a design method combining a CIF architecture and an MD architecture, and is implemented according to the basic architecture principle of loose coupling and layering, the basic idea is that the overall structure is CIF, namely the CIF is divided into an ODS layer, a DW layer and a DM layer, wherein the DW layer adopts the MD structure, and a fact table and a dimension table are used for constructing the DW layer;
the sixth type: the ODS layer converges service system data, the ODS layer is the layer closest to the data in the data source, and the data in the data source is extracted, cleaned and transmitted, and then loaded into the layer;
the data warehouse accesses the data increment or the whole quantity of the business systems such as SAP (ERP), DMS (distributor management system), WMS (warehouse management system), OA (office system), fee control system, EHR (human system), EAGLE (customer management system), DDI (flow direction interface data), LIMS (laboratory information management system), FONE (financial budget) and the like into the data warehouse at the layer, and provides support for the subsequent BI analysis and data service;
the seventh type: DW layer analysis dimension, establishing various data models according to subjects from data obtained from ODS layer, and covering the analysis dimension of enterprise business in the data warehouse at the layer, including but not limited to accounting subjects, cost centers, project and WBS main data, distributors, products, materials, organizations, posts, employees, hospitals, DTP pharmacies, doctors, speakers, suppliers, clients, channels, jurisdictions, hospitals and warehouses.
4. The big data storage optimization analysis system for the cloud platform according to claim 1, wherein: according to what is proposed in step S3, the following are included:
1) establishing a target of a BI data analysis platform, combing business data of an enterprise by establishing and implementing a BI data warehouse and main data, and opening a dispersed data isolated island to form structured data assets, thereby supporting business transformation of the enterprise and realizing enterprise strategy;
2) the functional architecture of the BI data analysis platform is implemented by using a pictorial AG (object Access gateway), and the BI data analysis platform is divided into five types, specifically as follows:
managing data sources, wherein the data source management comprises multi-database support and connection pool management;
managing a data set, wherein the data set comprises dynamic SQL, support dragging, result previewing and support variable replacing;
managing assemblies, wherein the managing assemblies comprise dragging type development, enriching assemblies, secondary index calculation, self-defining indexes and style modification;
managing the instrument panel, wherein the management comprises linkage, jumping, drilling, screening configuration, free component layout and result preview;
and managing the system, wherein the management comprises role management, department management, user management and menu management.
5. The big data storage optimization analysis system for the cloud platform according to claim 2, wherein: according to the data service set forth in the sixth aspect thereof, the categories include the following:
the method comprises the steps of firstly, configuring a message input PORT (PORT), wherein the PORT configuration is the core configuration of an IPAS interface, defines the corresponding relation among a transmission protocol, a data format, a dictionary and a business process, and manages services for configuring Web services and the like for external application access;
secondly, configuring message execution orchestration (COMMAND EDIATOR), and finishing coordinating the execution of a plurality of COMMANDs through the configuration of the COMMAND EDIATOR, thereby realizing the correct processing of the service data;
managing a message processing Command (Command), wherein the Command Command is a processing step of business data, comprises a plurality of types of business commands, selects a specific business processing Command and sets parameters required by the Command, so that the IPAS realizes a business target of a user, and currently supported Command types comprise commands of SAPIDOC file import, SapFaction access, mail sending, third-party WebAPI access, MSDYNAICS GP service access, Quickbooks service access, MySQL, Oracle, Postgresql database service access, remote server file upload and download service access and the like;
fourthly, file monitoring, file configuration monitoring and FTP file monitoring are carried out to enable the IPAS to automatically process the service data generated by the system, such as IDOC files generated by the SAP, and the service data uploaded to the FTP server by a user;
timing, wherein the timing management is to access the service of a third party by triggering a timing task;
parameter set, parameter set is the service of sharing COMMAND parameter content, for the function of message input PORT (PORT) configuration, because the COMMAND parameters between multiple APIs may be mostly the same, only a few parameters are different between each API, for the parameter content that multiple APIs can share, only need to configure a common parameter set for API to quote, the unique part of each API, continue to configure in each API's respective parameter list;
file uploading, which is used for uploading service of Mapping files;
the log is used for inquiring the relevant information of the IPAS interface;
ninthly, a mapping tool, a tool for generating a JSON format to JSON format mapping file.
6. The big data storage optimization analysis system for the cloud platform according to claim 3, wherein: the business analysis proposed according to its sixth category, contains the following analysis items:
firstly, understanding indexes, dimensions and business meanings:
according to the understanding of the requirement document, the related indexes, dimensions and business meanings are known, and information such as the definition of the indexes, a calculation formula, the dimensions, a data display form, whether drilling exists in graphic display, detailed information display, a business module to which the graphic display belongs and the like is acquired;
secondly, defining a service range:
defining a service range, and acquiring a related system and a related module from requirements and system investigation;
thirdly, researching data sources of the business system:
and acquiring a data source of the service system, determining a docking mode, a data structure and a data dictionary of the docking data, and analyzing the data by combining the service.
CN202210162512.2A 2022-02-22 2022-02-22 Cloud platform big data storage optimization analysis system Active CN114546415B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210162512.2A CN114546415B (en) 2022-02-22 2022-02-22 Cloud platform big data storage optimization analysis system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210162512.2A CN114546415B (en) 2022-02-22 2022-02-22 Cloud platform big data storage optimization analysis system

Publications (2)

Publication Number Publication Date
CN114546415A true CN114546415A (en) 2022-05-27
CN114546415B CN114546415B (en) 2024-07-09

Family

ID=81677311

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210162512.2A Active CN114546415B (en) 2022-02-22 2022-02-22 Cloud platform big data storage optimization analysis system

Country Status (1)

Country Link
CN (1) CN114546415B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115329015A (en) * 2022-10-14 2022-11-11 中孚安全技术有限公司 Data warehouse system with hybrid architecture and implementation method

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109189764A (en) * 2018-09-20 2019-01-11 北京桃花岛信息技术有限公司 A kind of colleges and universities' data warehouse layered design method based on Hive
US20200012659A1 (en) * 2018-07-06 2020-01-09 Snowflake Inc. Data replication and data failover in database systems
CN112199164A (en) * 2020-10-19 2021-01-08 国网新疆电力有限公司信息通信公司 Method for ensuring container mirror image consistency
CN112632025A (en) * 2020-08-25 2021-04-09 南方电网科学研究院有限责任公司 Power grid enterprise management decision support application system based on PAAS platform
US20210173846A1 (en) * 2019-05-06 2021-06-10 Oracle International Corporation System and method for automatic generation of bi models using data introspection and curation
CN113569278A (en) * 2021-06-25 2021-10-29 华能招标有限公司 Data sharing method and related equipment of multi-bidding platform based on block chain

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200012659A1 (en) * 2018-07-06 2020-01-09 Snowflake Inc. Data replication and data failover in database systems
CN109189764A (en) * 2018-09-20 2019-01-11 北京桃花岛信息技术有限公司 A kind of colleges and universities' data warehouse layered design method based on Hive
US20210173846A1 (en) * 2019-05-06 2021-06-10 Oracle International Corporation System and method for automatic generation of bi models using data introspection and curation
CN112632025A (en) * 2020-08-25 2021-04-09 南方电网科学研究院有限责任公司 Power grid enterprise management decision support application system based on PAAS platform
CN112199164A (en) * 2020-10-19 2021-01-08 国网新疆电力有限公司信息通信公司 Method for ensuring container mirror image consistency
CN113569278A (en) * 2021-06-25 2021-10-29 华能招标有限公司 Data sharing method and related equipment of multi-bidding platform based on block chain

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
蔡鸿明;姜祖海;姜丽红;: "分布式环境下业务模型的数据存储及访问框架", 清华大学学报(自然科学版), no. 06, 15 June 2017 (2017-06-15) *
赵毅;: "基于大数据平台构建数据仓库的研究与实践", 中国金融电脑, no. 05, 7 May 2017 (2017-05-07) *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115329015A (en) * 2022-10-14 2022-11-11 中孚安全技术有限公司 Data warehouse system with hybrid architecture and implementation method

Also Published As

Publication number Publication date
CN114546415B (en) 2024-07-09

Similar Documents

Publication Publication Date Title
JP6602355B2 (en) Cloud-based distributed persistence and cache data model
George HBase: the definitive guide
JP5819376B2 (en) A column smart mechanism for column-based databases
US8347207B2 (en) Automatically moving annotations associated with multidimensional data between live datacubes
CN112534396A (en) Diary watch in database system
CN101828182B (en) ETL-less zero redundancy system and method for reporting OLTP data
CN101208692B (en) Automatically moving multidimensional data between live datacubes of enterprise software systems
US9256472B2 (en) System and method for economical migration of legacy applications from mainframe and distributed platforms
CN107766402A (en) A kind of building dictionary cloud source of houses big data platform
CN106021484A (en) Customizable multi-mode big data processing system based on memory calculation
US20210004712A1 (en) Machine Learning Performance and Workload Management
CN108763234A (en) A kind of real time data synchronization method and system
CN114647716B (en) System suitable for generalized data warehouse
Doshi et al. Blending SQL and NewSQL approaches: reference architectures for enterprise big data challenges
CN113051263A (en) Metadata-based big data platform construction method, system, equipment and medium
Jia Google cloud computing platform technology architecture and the impact of its cost
CN114546415A (en) Big data storage optimization analysis system for cloud platform
US11615061B1 (en) Evaluating workload for database migration recommendations
CN103092872A (en) XML (Extensive Makeup Language) technology based isomerous database access method
Blakeley et al. Next-generation data access: Making the conceptual level real
TWM487489U (en) Walking-around type instructions reorganization design with service oriented architecture (SOA) to implement real-time business intelligence system
RU2795902C1 (en) Method and system for automated generation and filling of data marts using declaration description
Kovačević et al. Novel BI data architectures
Zhou et al. Review of prime issues in big data storage
MAHER et al. A Metadata Architecture over cloud for enabling knowledge support for SMEs

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant