CN113806454B - Method and system for solving data acquisition of data warehouse accounting daily cut test - Google Patents
Method and system for solving data acquisition of data warehouse accounting daily cut test Download PDFInfo
- Publication number
- CN113806454B CN113806454B CN202111125846.4A CN202111125846A CN113806454B CN 113806454 B CN113806454 B CN 113806454B CN 202111125846 A CN202111125846 A CN 202111125846A CN 113806454 B CN113806454 B CN 113806454B
- Authority
- CN
- China
- Prior art keywords
- data
- date
- date parameter
- parameter
- day
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/283—Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Quality & Reliability (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention belongs to the technical field of data processing, and particularly relates to a method and a system for acquiring data of a data warehouse for financial daily cutting test. The method solves the problem that the data obtained by cutting a plurality of days or cutting a plurality of days in the prior art is repeated. Setting a first date parameter and a second date parameter in a system of a test environment, and establishing a storage space and a corresponding storage structure; collecting configuration information, judging whether the type of data is increment or full quantity, and performing corresponding operation; the system provided by the invention comprises: the device comprises an acquisition module, a storage module and a deduplication module. The invention can effectively identify the real incremental data and bring convenience for subsequent data verification, in particular to financial verification.
Description
Technical Field
The invention belongs to the technical field of data processing, and particularly relates to a method and a system for acquiring data of a data warehouse for financial daily cutting test.
Background
In mainstream commercial banking, interactions with data warehouses are primarily through the business system to offload files. The internet banking actively collects service system data by using a big data technology Sqoop MapReduce. While the mainstream commercial banking system decouples through file interactions, the data is landed, the data warehouse actively acquires the data files, and then the data is loaded into the data warehouse. However, the overall operational efficiency is somewhat slower than data collection directly through big data techniques. However, when the big data technology is used for data acquisition, problems are encountered at the same time, and correct data can be acquired by cutting the production environment business system according to the daily normal day. In a test environment, when a main stream commercial banking business system is cut daily, the system time is cut daily, and due to the requirement of a test scene, the main stream commercial banking can completely meet different business scene test requirements, but in an internet banking, micro services are deployed in a containerized mode, one physical machine can containerize a plurality of applications, and if the system time of the machine is modified together while the business is cut daily, the system time of other applications is influenced. The patent solves the problem that in an Internet bank, the Internet bank experiences multiple daily cuts, the table and the inner last update time are the same day, and how to identify effective incremental data after daily cuts.
In the construction of an internet bank data warehouse, a file exchange platform is not available, sqoop MapReduce is adopted to actively collect service system data, and then the data is loaded to a source layer. In the case of incremental acquisition, the last_update_time is generally used as an incremental condition as a WHERE condition for acquiring data. However, when the service system is tested, the value of last_update_time is the same day and the acquired data after day cutting contains the previous day cutting data (except the first day cutting data), because last_update_time is the same day, the real increment data cannot be identified, and more uncertainty is brought to the subsequent data inspection, particularly the accounting inspection.
In order to obtain the real incremental data, it is necessary to provide a method and a system for solving the problem of data acquisition of the data warehouse for the financial daily cutting test.
Disclosure of Invention
Aiming at the repeated data obtained by cutting a plurality of days or cutting a plurality of days in the prior art, the invention provides a method and a system for solving the problem of data acquisition of the data warehouse accounting daily cutting test, which can effectively identify real incremental data and bring convenience for subsequent data verification, in particular to financial verification.
In one aspect, the invention provides a method for solving data collection of data warehouse accounting daily cut test, comprising the following steps:
s1, setting a first date parameter and a second date parameter in a system of a test environment, and respectively establishing storage spaces and corresponding storage structures corresponding to the first date parameter and the second date parameter in a storage module;
the first date parameter represents the time of the first day of cutting in a natural day, and the second date parameter represents the corresponding storage area of the data after each day of cutting in the storage medium;
s2, acquiring data acquisition configuration information for date in a system of a test environment through a source system table name parameter;
s3, performing date cutting on the data in the database according to the data acquisition configuration information in the step S2, storing the date cut data in a storage space of a corresponding second date parameter, judging the type of the data acquired at this time according to the number of the second date parameter of the current natural date and the acquired data amount, if the data is full data, storing the full data in a catalog corresponding to the system, if the data is incremental data, comparing and de-duplicating the data to obtain actual incremental data, and storing the actual incremental data in the catalog corresponding to the system.
Preferably, the step S1 further comprises,
setting a third date parameter in a system of the test environment, and establishing a storage space and a corresponding storage structure corresponding to the third date parameter in a storage module;
the third date parameter indicates a time on the same day as the last day and on a different natural day from the last day.
Preferably, the third date parameter is an optional parameter.
Preferably, in step S3, the specific method of deduplication is as follows,
and comparing the data acquired currently with the data in the storage space corresponding to the last second date parameter, and deleting the data repeated in the data acquired currently and the last second date parameter to obtain the real increase data.
On the other hand, the invention also provides a system for solving the problem of data acquisition of the data warehouse for the daily cutting test of the accounting, which comprises an acquisition module: the device is used for collecting configuration information, incremental data and full data;
and a storage module: the method comprises the steps of storing daily cut data and creating a corresponding storage area;
and a de-duplication module: and the same data used for removing the plurality of day cut data is used for obtaining the real increase data.
In summary, due to the adoption of the technical scheme, the beneficial effects of the invention are as follows:
1. the whole scheme solves the problem that the service system is arranged in a micro-service containerized mode, and is designed for meeting the data testing requirement of the data warehouse under different daily cut scenes;
2. under the background of big data technology, the collection efficiency of using Sqoop MapReduce is obviously improved compared with the interaction performance of the main stream commercial bank collection file and the data warehouse;
3. performing a day cut on a plurality of natural days, and determining a data range by setting optional parameters;
4. operating data acquisition in a big data environment, splitting the original incremental acquisition conditions and partition catalogues on a storage module, and cutting for a plurality of times on the same day, wherein the acquisition conditions are unchanged and can fall into different partitions in the storage module, so that the different partitions are loaded when the source layer data attaching layer data table is loaded;
5. and (3) collecting the source system data and putting the source system data into a storage module for multiple times on the same natural day, and rapidly removing repeated data from the twice day cut data through a duplication removal module to form a real data increment.
Drawings
The invention will now be described by way of example and with reference to the accompanying drawings in which:
fig. 1 is a flowchart of a method for resolving data collection of data warehouse accounting daily cut test according to an embodiment of the invention.
Detailed Description
For the purposes of making the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, but not all embodiments. The components of the embodiments of the present application, which are generally described and illustrated in the figures herein, may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, as provided in the accompanying drawings, is not intended to limit the scope of the application, as claimed, but is merely representative of selected embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present application without making any inventive effort, are intended to be within the scope of the present application.
In the description of the embodiments of the present application, the terms "first," "second," "third," and the like are used solely to distinguish one from another and are not to be construed as indicating or implying a relative importance.
The present invention will be described in detail with reference to the accompanying drawings.
As shown in fig. 1, a method for solving data collection of data warehouse accounting daily cut test includes the following steps:
s1, setting a first date parameter and a second date parameter in a system of a test environment, and respectively establishing storage spaces and corresponding storage structures corresponding to the first date parameter and the second date parameter in a storage module;
the first date parameter represents the time of the first day of cutting in a natural day, and the second date parameter represents the corresponding storage area of the data after each day of cutting in the storage medium;
s2, acquiring data acquisition configuration information for date in a system of a test environment through a source system table name parameter;
s3, performing date cutting on the data in the database according to the data acquisition configuration information in the step S2, storing the date cut data in a storage space of a corresponding second date parameter, judging the type of the data acquired at this time according to the number of the second date parameter of the current natural date and the acquired data amount, if the data is full data, storing the full data in a catalog corresponding to the system, if the data is incremental data, comparing and de-duplicating the data to obtain actual incremental data, and storing the actual incremental data in the catalog corresponding to the system.
In this embodiment, the test environment runs in the Xcat. Py program.
In this embodiment, the first date parameter is generally defined as a date, and the date is expressed as a value of a field etl_dt of the table in the source layer, WHERE the data of the source system is collected (last_update_time corresponds to system time) under a WHERE condition (WHERE last_update_time > =first date parameter) in the SQL statement is used; the model layer runs in batches every day and acquires a data set from the source layer under the condition of the ETL_DT; acquiring a table name of a source system by a table name parameter of the source system; and reading background configuration information through an Xcat.py program, and collecting source system data.
In the embodiment, the second date parameter is appointed AND stored in the partition field on the HDFS, AND replaces the original first date parameter as the partition field on the HDFS, AND the function of the first date parameter is as the parameter of the first date parameter AND the first date parameter which is more than or equal to the first date parameter in the WHERE condition of the acquisition source system; the second date parameter solves the problem that when multiple batches run on the same day, data can accurately fall on different partition positions of the HDFS and can be stored in different partition ETL_DT in Hive.
In this embodiment, the last two partition data deduplication functions on HDFS are newly added in the xcat.
In another embodiment, step S1 further comprises,
setting a third date parameter in a system of the test environment, and establishing a storage space and a corresponding storage structure corresponding to the third date parameter in a storage module;
the third date parameter indicates a time on the same day as the last day and on a different natural day from the last day.
In this embodiment, the tester does not complete and is cut-to-date data on the first natural day, continues to perform the test on the second natural day, and makes the cut-to-date data. Complete data cannot be acquired by using the first date parameter, AND then the first date parameter AND the second date parameter of the BETWEEN in the increment WHERE condition are added to meet the requirement of acquiring the data in increment.
In another embodiment, the third date parameter is an optional parameter.
In this embodiment, the third date parameter is not limited to the second natural day, and the time can be freely selected according to the day-cut requirement.
In another embodiment, in step S3, the specific method of deduplication is as follows,
and comparing the data acquired currently with the data in the storage space corresponding to the last second date parameter, and deleting the data repeated in the data acquired currently and the last second date parameter to obtain the real increase data.
In this embodiment, the same day is cut twice, the first time is cut at time a, the first date parameter collects data from time 0 to time a, the second time is cut at time B, the first date parameter collects data from time 0 to time B, the second time is cut, real incremental data (data from time a to time B) cannot be obtained, and the second time is identified after repeated data are removed from the first time and the second time.
In another embodiment, a system for resolving data warehouse accounting daily cut test data collection includes,
and the acquisition module is used for: the device is used for collecting configuration information, incremental data and full data;
and a storage module: the method comprises the steps of storing daily cut data and creating a corresponding storage area;
and a de-duplication module: and the same data used for removing the plurality of day cut data is used for obtaining the real increase data.
The foregoing examples merely represent specific embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the present application. It should be noted that, for those skilled in the art, several variations and modifications can be made without departing from the technical solution of the present application, which fall within the protection scope of the present application.
Claims (5)
1. A method for solving data acquisition of data warehouse accounting daily cut test is characterized in that: the method comprises the following steps:
s1, setting a first date parameter and a second date parameter in a system of a test environment, and respectively establishing storage spaces and corresponding storage structures corresponding to the first date parameter and the second date parameter in a storage module;
the first date parameter represents the time of the first day of cutting in a natural day, and the second date parameter represents the corresponding storage area of the data after each day of cutting in the storage medium;
s2, acquiring data acquisition configuration information for date in a system of a test environment through a source system table name parameter;
s3, performing date cutting on the data in the database according to the data acquisition configuration information in the step S2, storing the date cut data in a storage space of a corresponding second date parameter, judging the type of the data acquired at this time according to the number of the second date parameter of the current natural date and the acquired data amount, if the data is full data, storing the full data in a catalog corresponding to the system, if the data is incremental data, comparing and de-duplicating the data to obtain actual incremental data, and storing the actual incremental data in the catalog corresponding to the system.
2. The method for resolving data warehouse accounting daily cut test data collection as claimed in claim 1, wherein: the step S1 also includes the steps of,
setting a third date parameter in a system of the test environment, and establishing a storage space and a corresponding storage structure corresponding to the third date parameter in a storage module;
the third date parameter indicates a time on the same day as the last day and on a different natural day from the last day.
3. A method of resolving data warehouse accounting daily cut test data collection as claimed in claim 2, wherein: the third date parameter is an optional parameter.
4. The method for resolving data warehouse accounting daily cut test data collection as claimed in claim 1, wherein: in step S3, the specific method of deduplication is as follows,
and comparing the data acquired currently with the data in the storage space corresponding to the last second date parameter, and deleting the data repeated in the data acquired currently and the last second date parameter to obtain the real increase data.
5. A system for resolving data warehouse accounting daily cut test data collection using the method of any one of claims 1-4, characterized by: comprising the steps of (a) a step of,
and the acquisition module is used for: the device is used for collecting configuration information, incremental data and full data;
and a storage module: the method comprises the steps of storing daily cut data and creating a corresponding storage area;
and a de-duplication module: and the same data used for removing the plurality of day cut data is used for obtaining the real increase data.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111125846.4A CN113806454B (en) | 2021-09-24 | 2021-09-24 | Method and system for solving data acquisition of data warehouse accounting daily cut test |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111125846.4A CN113806454B (en) | 2021-09-24 | 2021-09-24 | Method and system for solving data acquisition of data warehouse accounting daily cut test |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113806454A CN113806454A (en) | 2021-12-17 |
CN113806454B true CN113806454B (en) | 2023-06-06 |
Family
ID=78896597
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111125846.4A Active CN113806454B (en) | 2021-09-24 | 2021-09-24 | Method and system for solving data acquisition of data warehouse accounting daily cut test |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113806454B (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107908672A (en) * | 2017-10-24 | 2018-04-13 | 深圳前海微众银行股份有限公司 | Application report implementation method, equipment and storage medium based on Hadoop platform |
CN108920698A (en) * | 2018-07-16 | 2018-11-30 | 北京京东金融科技控股有限公司 | A kind of method of data synchronization, device, system, medium and electronic equipment |
CN109241189A (en) * | 2018-09-10 | 2019-01-18 | 阿里巴巴集团控股有限公司 | Storage method, device and the equipment of data |
CN111008078A (en) * | 2019-12-05 | 2020-04-14 | 深圳前海微众银行股份有限公司 | Batch processing method, device and equipment of data and computer storage medium |
CN111190868A (en) * | 2020-01-02 | 2020-05-22 | 中国建设银行股份有限公司 | File processing method and device |
WO2021057515A1 (en) * | 2019-09-27 | 2021-04-01 | 中兴通讯股份有限公司 | Date switching data offloading method, apparatus, device, and medium |
-
2021
- 2021-09-24 CN CN202111125846.4A patent/CN113806454B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107908672A (en) * | 2017-10-24 | 2018-04-13 | 深圳前海微众银行股份有限公司 | Application report implementation method, equipment and storage medium based on Hadoop platform |
CN108920698A (en) * | 2018-07-16 | 2018-11-30 | 北京京东金融科技控股有限公司 | A kind of method of data synchronization, device, system, medium and electronic equipment |
CN109241189A (en) * | 2018-09-10 | 2019-01-18 | 阿里巴巴集团控股有限公司 | Storage method, device and the equipment of data |
WO2021057515A1 (en) * | 2019-09-27 | 2021-04-01 | 中兴通讯股份有限公司 | Date switching data offloading method, apparatus, device, and medium |
CN111008078A (en) * | 2019-12-05 | 2020-04-14 | 深圳前海微众银行股份有限公司 | Batch processing method, device and equipment of data and computer storage medium |
CN111190868A (en) * | 2020-01-02 | 2020-05-22 | 中国建设银行股份有限公司 | File processing method and device |
Non-Patent Citations (2)
Title |
---|
基于Solr的异构资源集成检索框架设计与实现;马凤娟;吴鹏飞;;现代情报(第08期);135-137 * |
基于数据同步的集群***不间断服务的设计与实现;陈东江;《中国优秀硕士学位论文全文数据库信息科技辑》(第05期);I138-398 * |
Also Published As
Publication number | Publication date |
---|---|
CN113806454A (en) | 2021-12-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108694195B (en) | Management method and system of distributed data warehouse | |
CN109344056B (en) | Test method and test device | |
CN110321383A (en) | Big data platform method of data synchronization, device, computer equipment and storage medium | |
CA3176450A1 (en) | Method and apparatus for implementing incremental data consistency | |
CN112256656B (en) | Transaction rollback method and device, database, system and computer storage medium | |
CN106682036A (en) | Data exchange system and exchange method thereof | |
US20030033291A1 (en) | SQL execution analysis | |
EP3251298A1 (en) | Data extraction | |
CN106155838B (en) | A kind of database back-up data restoration methods and device | |
CN111221570A (en) | Migration method, device, equipment and storage medium of microservice framework | |
CN113391943A (en) | Micro-service fault root cause positioning method and device based on cause and effect inference | |
CN108154289A (en) | A kind of product quality factor information automatic analysis system and automatic analysis method | |
CN112307124A (en) | Database synchronization verification method, device, equipment and storage medium | |
CN108647118A (en) | Copy abnormal restoring method, device and computer equipment based on storage cluster | |
CN111078719A (en) | Data recovery method and device, storage medium and processor | |
CN107590233B (en) | File management method and device | |
CN113806454B (en) | Method and system for solving data acquisition of data warehouse accounting daily cut test | |
CN110309206B (en) | Order information acquisition method and system | |
CN107831741A (en) | New time series data read method for power plant | |
CN113868283A (en) | Data testing method, device, equipment and computer storage medium | |
CN113742145A (en) | Method, system, equipment and storage medium for testing performance of solid state disk | |
CN115629950B (en) | Extraction method of performance test asynchronous request processing time point | |
CN113553320B (en) | Data quality monitoring method and device | |
CN115687599B (en) | Service data processing method and device, electronic equipment and storage medium | |
CN112765170B (en) | Embedded time sequence data management method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |