CN113806454B - Method and system for solving data acquisition of data warehouse accounting daily cut test - Google Patents

Method and system for solving data acquisition of data warehouse accounting daily cut test Download PDF

Info

Publication number
CN113806454B
CN113806454B CN202111125846.4A CN202111125846A CN113806454B CN 113806454 B CN113806454 B CN 113806454B CN 202111125846 A CN202111125846 A CN 202111125846A CN 113806454 B CN113806454 B CN 113806454B
Authority
CN
China
Prior art keywords
data
date
date parameter
parameter
day
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111125846.4A
Other languages
Chinese (zh)
Other versions
CN113806454A (en
Inventor
田德惠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan XW Bank Co Ltd
Original Assignee
Sichuan XW Bank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan XW Bank Co Ltd filed Critical Sichuan XW Bank Co Ltd
Priority to CN202111125846.4A priority Critical patent/CN113806454B/en
Publication of CN113806454A publication Critical patent/CN113806454A/en
Application granted granted Critical
Publication of CN113806454B publication Critical patent/CN113806454B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to the technical field of data processing, and particularly relates to a method and a system for acquiring data of a data warehouse for financial daily cutting test. The method solves the problem that the data obtained by cutting a plurality of days or cutting a plurality of days in the prior art is repeated. Setting a first date parameter and a second date parameter in a system of a test environment, and establishing a storage space and a corresponding storage structure; collecting configuration information, judging whether the type of data is increment or full quantity, and performing corresponding operation; the system provided by the invention comprises: the device comprises an acquisition module, a storage module and a deduplication module. The invention can effectively identify the real incremental data and bring convenience for subsequent data verification, in particular to financial verification.

Description

Method and system for solving data acquisition of data warehouse accounting daily cut test
Technical Field
The invention belongs to the technical field of data processing, and particularly relates to a method and a system for acquiring data of a data warehouse for financial daily cutting test.
Background
In mainstream commercial banking, interactions with data warehouses are primarily through the business system to offload files. The internet banking actively collects service system data by using a big data technology Sqoop MapReduce. While the mainstream commercial banking system decouples through file interactions, the data is landed, the data warehouse actively acquires the data files, and then the data is loaded into the data warehouse. However, the overall operational efficiency is somewhat slower than data collection directly through big data techniques. However, when the big data technology is used for data acquisition, problems are encountered at the same time, and correct data can be acquired by cutting the production environment business system according to the daily normal day. In a test environment, when a main stream commercial banking business system is cut daily, the system time is cut daily, and due to the requirement of a test scene, the main stream commercial banking can completely meet different business scene test requirements, but in an internet banking, micro services are deployed in a containerized mode, one physical machine can containerize a plurality of applications, and if the system time of the machine is modified together while the business is cut daily, the system time of other applications is influenced. The patent solves the problem that in an Internet bank, the Internet bank experiences multiple daily cuts, the table and the inner last update time are the same day, and how to identify effective incremental data after daily cuts.
In the construction of an internet bank data warehouse, a file exchange platform is not available, sqoop MapReduce is adopted to actively collect service system data, and then the data is loaded to a source layer. In the case of incremental acquisition, the last_update_time is generally used as an incremental condition as a WHERE condition for acquiring data. However, when the service system is tested, the value of last_update_time is the same day and the acquired data after day cutting contains the previous day cutting data (except the first day cutting data), because last_update_time is the same day, the real increment data cannot be identified, and more uncertainty is brought to the subsequent data inspection, particularly the accounting inspection.
In order to obtain the real incremental data, it is necessary to provide a method and a system for solving the problem of data acquisition of the data warehouse for the financial daily cutting test.
Disclosure of Invention
Aiming at the repeated data obtained by cutting a plurality of days or cutting a plurality of days in the prior art, the invention provides a method and a system for solving the problem of data acquisition of the data warehouse accounting daily cutting test, which can effectively identify real incremental data and bring convenience for subsequent data verification, in particular to financial verification.
In one aspect, the invention provides a method for solving data collection of data warehouse accounting daily cut test, comprising the following steps:
s1, setting a first date parameter and a second date parameter in a system of a test environment, and respectively establishing storage spaces and corresponding storage structures corresponding to the first date parameter and the second date parameter in a storage module;
the first date parameter represents the time of the first day of cutting in a natural day, and the second date parameter represents the corresponding storage area of the data after each day of cutting in the storage medium;
s2, acquiring data acquisition configuration information for date in a system of a test environment through a source system table name parameter;
s3, performing date cutting on the data in the database according to the data acquisition configuration information in the step S2, storing the date cut data in a storage space of a corresponding second date parameter, judging the type of the data acquired at this time according to the number of the second date parameter of the current natural date and the acquired data amount, if the data is full data, storing the full data in a catalog corresponding to the system, if the data is incremental data, comparing and de-duplicating the data to obtain actual incremental data, and storing the actual incremental data in the catalog corresponding to the system.
Preferably, the step S1 further comprises,
setting a third date parameter in a system of the test environment, and establishing a storage space and a corresponding storage structure corresponding to the third date parameter in a storage module;
the third date parameter indicates a time on the same day as the last day and on a different natural day from the last day.
Preferably, the third date parameter is an optional parameter.
Preferably, in step S3, the specific method of deduplication is as follows,
and comparing the data acquired currently with the data in the storage space corresponding to the last second date parameter, and deleting the data repeated in the data acquired currently and the last second date parameter to obtain the real increase data.
On the other hand, the invention also provides a system for solving the problem of data acquisition of the data warehouse for the daily cutting test of the accounting, which comprises an acquisition module: the device is used for collecting configuration information, incremental data and full data;
and a storage module: the method comprises the steps of storing daily cut data and creating a corresponding storage area;
and a de-duplication module: and the same data used for removing the plurality of day cut data is used for obtaining the real increase data.
In summary, due to the adoption of the technical scheme, the beneficial effects of the invention are as follows:
1. the whole scheme solves the problem that the service system is arranged in a micro-service containerized mode, and is designed for meeting the data testing requirement of the data warehouse under different daily cut scenes;
2. under the background of big data technology, the collection efficiency of using Sqoop MapReduce is obviously improved compared with the interaction performance of the main stream commercial bank collection file and the data warehouse;
3. performing a day cut on a plurality of natural days, and determining a data range by setting optional parameters;
4. operating data acquisition in a big data environment, splitting the original incremental acquisition conditions and partition catalogues on a storage module, and cutting for a plurality of times on the same day, wherein the acquisition conditions are unchanged and can fall into different partitions in the storage module, so that the different partitions are loaded when the source layer data attaching layer data table is loaded;
5. and (3) collecting the source system data and putting the source system data into a storage module for multiple times on the same natural day, and rapidly removing repeated data from the twice day cut data through a duplication removal module to form a real data increment.
Drawings
The invention will now be described by way of example and with reference to the accompanying drawings in which:
fig. 1 is a flowchart of a method for resolving data collection of data warehouse accounting daily cut test according to an embodiment of the invention.
Detailed Description
For the purposes of making the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, but not all embodiments. The components of the embodiments of the present application, which are generally described and illustrated in the figures herein, may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, as provided in the accompanying drawings, is not intended to limit the scope of the application, as claimed, but is merely representative of selected embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present application without making any inventive effort, are intended to be within the scope of the present application.
In the description of the embodiments of the present application, the terms "first," "second," "third," and the like are used solely to distinguish one from another and are not to be construed as indicating or implying a relative importance.
The present invention will be described in detail with reference to the accompanying drawings.
As shown in fig. 1, a method for solving data collection of data warehouse accounting daily cut test includes the following steps:
s1, setting a first date parameter and a second date parameter in a system of a test environment, and respectively establishing storage spaces and corresponding storage structures corresponding to the first date parameter and the second date parameter in a storage module;
the first date parameter represents the time of the first day of cutting in a natural day, and the second date parameter represents the corresponding storage area of the data after each day of cutting in the storage medium;
s2, acquiring data acquisition configuration information for date in a system of a test environment through a source system table name parameter;
s3, performing date cutting on the data in the database according to the data acquisition configuration information in the step S2, storing the date cut data in a storage space of a corresponding second date parameter, judging the type of the data acquired at this time according to the number of the second date parameter of the current natural date and the acquired data amount, if the data is full data, storing the full data in a catalog corresponding to the system, if the data is incremental data, comparing and de-duplicating the data to obtain actual incremental data, and storing the actual incremental data in the catalog corresponding to the system.
In this embodiment, the test environment runs in the Xcat. Py program.
In this embodiment, the first date parameter is generally defined as a date, and the date is expressed as a value of a field etl_dt of the table in the source layer, WHERE the data of the source system is collected (last_update_time corresponds to system time) under a WHERE condition (WHERE last_update_time > =first date parameter) in the SQL statement is used; the model layer runs in batches every day and acquires a data set from the source layer under the condition of the ETL_DT; acquiring a table name of a source system by a table name parameter of the source system; and reading background configuration information through an Xcat.py program, and collecting source system data.
In the embodiment, the second date parameter is appointed AND stored in the partition field on the HDFS, AND replaces the original first date parameter as the partition field on the HDFS, AND the function of the first date parameter is as the parameter of the first date parameter AND the first date parameter which is more than or equal to the first date parameter in the WHERE condition of the acquisition source system; the second date parameter solves the problem that when multiple batches run on the same day, data can accurately fall on different partition positions of the HDFS and can be stored in different partition ETL_DT in Hive.
In this embodiment, the last two partition data deduplication functions on HDFS are newly added in the xcat.
In another embodiment, step S1 further comprises,
setting a third date parameter in a system of the test environment, and establishing a storage space and a corresponding storage structure corresponding to the third date parameter in a storage module;
the third date parameter indicates a time on the same day as the last day and on a different natural day from the last day.
In this embodiment, the tester does not complete and is cut-to-date data on the first natural day, continues to perform the test on the second natural day, and makes the cut-to-date data. Complete data cannot be acquired by using the first date parameter, AND then the first date parameter AND the second date parameter of the BETWEEN in the increment WHERE condition are added to meet the requirement of acquiring the data in increment.
In another embodiment, the third date parameter is an optional parameter.
In this embodiment, the third date parameter is not limited to the second natural day, and the time can be freely selected according to the day-cut requirement.
In another embodiment, in step S3, the specific method of deduplication is as follows,
and comparing the data acquired currently with the data in the storage space corresponding to the last second date parameter, and deleting the data repeated in the data acquired currently and the last second date parameter to obtain the real increase data.
In this embodiment, the same day is cut twice, the first time is cut at time a, the first date parameter collects data from time 0 to time a, the second time is cut at time B, the first date parameter collects data from time 0 to time B, the second time is cut, real incremental data (data from time a to time B) cannot be obtained, and the second time is identified after repeated data are removed from the first time and the second time.
In another embodiment, a system for resolving data warehouse accounting daily cut test data collection includes,
and the acquisition module is used for: the device is used for collecting configuration information, incremental data and full data;
and a storage module: the method comprises the steps of storing daily cut data and creating a corresponding storage area;
and a de-duplication module: and the same data used for removing the plurality of day cut data is used for obtaining the real increase data.
The foregoing examples merely represent specific embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the present application. It should be noted that, for those skilled in the art, several variations and modifications can be made without departing from the technical solution of the present application, which fall within the protection scope of the present application.

Claims (5)

1. A method for solving data acquisition of data warehouse accounting daily cut test is characterized in that: the method comprises the following steps:
s1, setting a first date parameter and a second date parameter in a system of a test environment, and respectively establishing storage spaces and corresponding storage structures corresponding to the first date parameter and the second date parameter in a storage module;
the first date parameter represents the time of the first day of cutting in a natural day, and the second date parameter represents the corresponding storage area of the data after each day of cutting in the storage medium;
s2, acquiring data acquisition configuration information for date in a system of a test environment through a source system table name parameter;
s3, performing date cutting on the data in the database according to the data acquisition configuration information in the step S2, storing the date cut data in a storage space of a corresponding second date parameter, judging the type of the data acquired at this time according to the number of the second date parameter of the current natural date and the acquired data amount, if the data is full data, storing the full data in a catalog corresponding to the system, if the data is incremental data, comparing and de-duplicating the data to obtain actual incremental data, and storing the actual incremental data in the catalog corresponding to the system.
2. The method for resolving data warehouse accounting daily cut test data collection as claimed in claim 1, wherein: the step S1 also includes the steps of,
setting a third date parameter in a system of the test environment, and establishing a storage space and a corresponding storage structure corresponding to the third date parameter in a storage module;
the third date parameter indicates a time on the same day as the last day and on a different natural day from the last day.
3. A method of resolving data warehouse accounting daily cut test data collection as claimed in claim 2, wherein: the third date parameter is an optional parameter.
4. The method for resolving data warehouse accounting daily cut test data collection as claimed in claim 1, wherein: in step S3, the specific method of deduplication is as follows,
and comparing the data acquired currently with the data in the storage space corresponding to the last second date parameter, and deleting the data repeated in the data acquired currently and the last second date parameter to obtain the real increase data.
5. A system for resolving data warehouse accounting daily cut test data collection using the method of any one of claims 1-4, characterized by: comprising the steps of (a) a step of,
and the acquisition module is used for: the device is used for collecting configuration information, incremental data and full data;
and a storage module: the method comprises the steps of storing daily cut data and creating a corresponding storage area;
and a de-duplication module: and the same data used for removing the plurality of day cut data is used for obtaining the real increase data.
CN202111125846.4A 2021-09-24 2021-09-24 Method and system for solving data acquisition of data warehouse accounting daily cut test Active CN113806454B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111125846.4A CN113806454B (en) 2021-09-24 2021-09-24 Method and system for solving data acquisition of data warehouse accounting daily cut test

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111125846.4A CN113806454B (en) 2021-09-24 2021-09-24 Method and system for solving data acquisition of data warehouse accounting daily cut test

Publications (2)

Publication Number Publication Date
CN113806454A CN113806454A (en) 2021-12-17
CN113806454B true CN113806454B (en) 2023-06-06

Family

ID=78896597

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111125846.4A Active CN113806454B (en) 2021-09-24 2021-09-24 Method and system for solving data acquisition of data warehouse accounting daily cut test

Country Status (1)

Country Link
CN (1) CN113806454B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107908672A (en) * 2017-10-24 2018-04-13 深圳前海微众银行股份有限公司 Application report implementation method, equipment and storage medium based on Hadoop platform
CN108920698A (en) * 2018-07-16 2018-11-30 北京京东金融科技控股有限公司 A kind of method of data synchronization, device, system, medium and electronic equipment
CN109241189A (en) * 2018-09-10 2019-01-18 阿里巴巴集团控股有限公司 Storage method, device and the equipment of data
CN111008078A (en) * 2019-12-05 2020-04-14 深圳前海微众银行股份有限公司 Batch processing method, device and equipment of data and computer storage medium
CN111190868A (en) * 2020-01-02 2020-05-22 中国建设银行股份有限公司 File processing method and device
WO2021057515A1 (en) * 2019-09-27 2021-04-01 中兴通讯股份有限公司 Date switching data offloading method, apparatus, device, and medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107908672A (en) * 2017-10-24 2018-04-13 深圳前海微众银行股份有限公司 Application report implementation method, equipment and storage medium based on Hadoop platform
CN108920698A (en) * 2018-07-16 2018-11-30 北京京东金融科技控股有限公司 A kind of method of data synchronization, device, system, medium and electronic equipment
CN109241189A (en) * 2018-09-10 2019-01-18 阿里巴巴集团控股有限公司 Storage method, device and the equipment of data
WO2021057515A1 (en) * 2019-09-27 2021-04-01 中兴通讯股份有限公司 Date switching data offloading method, apparatus, device, and medium
CN111008078A (en) * 2019-12-05 2020-04-14 深圳前海微众银行股份有限公司 Batch processing method, device and equipment of data and computer storage medium
CN111190868A (en) * 2020-01-02 2020-05-22 中国建设银行股份有限公司 File processing method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于Solr的异构资源集成检索框架设计与实现;马凤娟;吴鹏飞;;现代情报(第08期);135-137 *
基于数据同步的集群***不间断服务的设计与实现;陈东江;《中国优秀硕士学位论文全文数据库信息科技辑》(第05期);I138-398 *

Also Published As

Publication number Publication date
CN113806454A (en) 2021-12-17

Similar Documents

Publication Publication Date Title
CN108694195B (en) Management method and system of distributed data warehouse
CN109344056B (en) Test method and test device
CN110321383A (en) Big data platform method of data synchronization, device, computer equipment and storage medium
CA3176450A1 (en) Method and apparatus for implementing incremental data consistency
CN112256656B (en) Transaction rollback method and device, database, system and computer storage medium
CN106682036A (en) Data exchange system and exchange method thereof
US20030033291A1 (en) SQL execution analysis
EP3251298A1 (en) Data extraction
CN106155838B (en) A kind of database back-up data restoration methods and device
CN111221570A (en) Migration method, device, equipment and storage medium of microservice framework
CN113391943A (en) Micro-service fault root cause positioning method and device based on cause and effect inference
CN108154289A (en) A kind of product quality factor information automatic analysis system and automatic analysis method
CN112307124A (en) Database synchronization verification method, device, equipment and storage medium
CN108647118A (en) Copy abnormal restoring method, device and computer equipment based on storage cluster
CN111078719A (en) Data recovery method and device, storage medium and processor
CN107590233B (en) File management method and device
CN113806454B (en) Method and system for solving data acquisition of data warehouse accounting daily cut test
CN110309206B (en) Order information acquisition method and system
CN107831741A (en) New time series data read method for power plant
CN113868283A (en) Data testing method, device, equipment and computer storage medium
CN113742145A (en) Method, system, equipment and storage medium for testing performance of solid state disk
CN115629950B (en) Extraction method of performance test asynchronous request processing time point
CN113553320B (en) Data quality monitoring method and device
CN115687599B (en) Service data processing method and device, electronic equipment and storage medium
CN112765170B (en) Embedded time sequence data management method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant