CN115965329A

CN115965329A - Scientific and technological big data intelligent decision analysis method

Info

Publication number: CN115965329A
Application number: CN202211460302.8A
Authority: CN
Inventors: 上官学奎; 马瑾男; 王志文; 倪凯明; 宋可; 张敬康; 尉建兴; 张颖; 文豪
Original assignee: Shanxi Information Industry Technology Research Institute Co ltd
Current assignee: Shanxi Information Industry Technology Research Institute Co ltd
Priority date: 2022-11-17
Filing date: 2022-11-17
Publication date: 2023-04-14

Abstract

The invention belongs to the technical field of decision analysis methods, and particularly relates to a scientific and technological big data intelligent decision analysis method, which comprises the following steps: s1, collecting data of an integral platform; s2, managing the collected data of the whole platform; and S3, integrating the data. The invention collects data from the intermediate database or the front-end processor database, generates data packets according to a defined data exchange mode, and sends the data packets to a receiver through a customized sending route. And the receiver automatically unpacks the data packet after receiving the data packet and stores the data into a front-end processor exchange database of the receiver. According to different business scenes, the acquired data is classified, and the data comprises structured data, unstructured data and semi-structured data. The capacity of accessing, processing and driving various types of data is provided for the data management device, and the capacity of quickly fusing the accessed structured data, unstructured data and semi-structured data is realized through the cooperation of multiple engines.

Description

Scientific and technological big data intelligent decision analysis method

Technical Field

The invention belongs to the technical field of decision analysis methods, and particularly relates to a scientific and technological big data intelligent decision analysis method.

Background

At present, a cloud computing platform is also called a cloud platform, wherein a storage type cloud platform mainly based on data storage, a computing type cloud platform mainly based on data processing, and a comprehensive cloud computing platform taking both computing and data storage processing into consideration all belong to the category of cloud computing platforms, and the cloud computing service platform utilizes a big data processing function to meet different requests of people, so as to provide services for people.

The cloud computing service platforms on the market are various and can basically meet the use requirements of people, but certain defects still exist.

Through the above analysis, the problems and defects of the prior art are as follows:

(1) The defects in unified operation and maintenance: the big data of science and technology government affairs is a systematic project, relates to a large amount of system construction and continuous operation and maintenance, and along with the diversified development of government affair application, the application scale is more and more huge, the structure of application is more and more complicated, and more elaborate operation and maintenance work is needed to ensure the safe and stable operation of the system. The operation and maintenance work after the traditional government affair system is built is usually undertaken by the construction manufacturers, the operation and maintenance management work is complex and low in efficiency, the mechanism is not sound, and the phenomenon that the withered soil is torn due to the occurrence of problems is serious.

(2) The defects on data intercommunication are as follows: the current big data construction of government affairs still has the difficult problem of information sharing, and information island problem still exists widely. The past project system construction mode causes data to be dispersed in each department and is bound by line segmentation and administration, and data of vertical management departments are concentrated in national and provincial data centers and need to coordinate superior authorities; the data is regarded as private property by part of units, and is not provided or provided less and provided at night for the reason of safety; the data quality of some units is not high, because the data is 'who provides the responsibility', the data is not too dare to be provided, and each committee office is not willing to provide the data actively, is not provided, and is not necessarily provided from the benefit and the risk of the office; due to the multiple data transfer between the data provider and the data user, the real-time performance, the freshness and the authority of the data cannot be guaranteed.

(3) A drawback in data standardization: through the construction of various government systems for many years, a large amount of data is deposited by government departments, but data application, data service and data asset management are basically blank, a large amount of data is stored in electronic documents such as excel/word and the like, and part of data is paper-edition data. The data can not be directly used by the program, and the problems of repeated acquisition, repeated entry and the like are often needed when the data are called among departments, so that manpower and material resources are seriously wasted. When the business system uses data, the data needs to be collected, processed, cleaned and managed uniformly, and can be used after being managed through a series of standardized data, and the managed data still cannot be shared due to different data formats and data structures.

(4) Drawbacks in rapid build applications: the traditional application system development comprises the development of a front end, a back end, a database, a middleware and a server, a large number of product research and development personnel are required to write codes for realization, the defects of high personnel cost, low research and development efficiency, weak expansion capability, high operation and maintenance cost and the like exist generally, and the client requirements can not be responded to basically in time.

(5) The disadvantages on the standard system: in the big data construction process of government affairs, each firm all has own architectural specification and design standard, between each firm in technical standard specification, architectural standard specification, data standard specification, platform instrument technical standard, data safety standard mutually different, this kind of condition can cause huge hidden danger to aspects such as data butt joint, interface call, instrument standard, the safety construction of each system in the system construction of later stage, the ubiquitous "to last, can not connect, with not good" scheduling problem, be the indirect murder that causes the information isolated island.

Disclosure of Invention

Aiming at the technical problems, the invention provides a scientific big data intelligent decision analysis method with high efficiency, wide application range and small error.

In order to solve the technical problems, the invention adopts the technical scheme that:

a scientific and technological big data intelligent decision analysis method comprises the following steps:

s1, collecting data of an integral platform;

s2, managing the collected data of the whole platform;

and S3, integrating the data.

The method for collecting the data of the whole platform in the S1 comprises the following steps: collecting data from the intermediate library or the database of the front-end processor, generating a data packet according to a defined data exchange mode, and sending the data packet to a receiver by a customized sending route; the receiver automatically carries out unpacking processing after receiving the data packet and stores the data into a front-end processor exchange database of the receiver; the acquisition content mainly comprises the following steps: full collection and incremental collection. The collection mode mainly comprises the following steps: trigger mode, time stamp mode, full table delete insert mode, full table compare mode. The collection tool mainly comprises: ftp file transmission, a data import tool and a WebService interface; the collected data including structured data, unstructured data and semi-structured data are classified according to different service scenes by establishing data standards, data collection, data processing and data loading and are stored in an original library.

The method for managing the data collected by the integral platform in the S2 comprises the following steps: the data acquisition device is used for acquiring source data, the big data processing technology is used for managing and optimizing dirty data, repeated data and error data in the original database data, and after the management is finished, the big data algorithm technology is used for realizing the standardization of the data, so that the data management is quickly realized.

The method for treating and optimizing dirty data, repeated data and error data in the original database data by using the big data processing technology comprises the following steps: and filtering data which do not meet the standard specification or are invalid in the service data, and configuring different data cleaning processes including data filtering and data deduplication according to different sources of the acquired data.

The data filtering method comprises the following steps: selecting data filtering, selecting corresponding field configuration corresponding filtering rules for filtering, and marking and identifying in a labeling mode: standard rules and filtering of mobile phone numbers; unifying standard rules and filtering of social credit codes; standard rules and filtering of the identity card; standard rules for name filtering and filtering.

The data deduplication method comprises the following steps: removing repeated data to ensure uniqueness, determining field ordering of data updating sequence through a time/self-increment main key according to relevant service attributes, configuring and removing repetition, and selecting corresponding fields as conditions for removing repetition; the sequencing rule is as follows: an autonomically-enabled key sequence; a data update timestamp order; a service serial number sequence; other may determine the order of duplicate data updates; judging the basis of duplicate removal; then, whether each field can be repeated is judged, and otherwise, the duplication removal is needed.

The method for realizing the standardization of the data by utilizing the big data algorithm technology comprises the following steps: comprises the following steps:

s2.1, in a standardization processing stage, combining data standard management, sorting defined data elements, data dictionaries, relevant standard specifications and business processes, performing early maintenance in a metadata management module, and performing unified management on data contents, use ranges, use modes, quality, update modes, update cycles, data sources, responsible units, data providers, storage positions, processing logics, data formats and other contents so as to manage converged and accessed data resources;

and S2.2, the standard data elements are docked with the standard data element library, and the standard data elements can be synchronized into the local data element library in a manual or system automatic leading-in mode according to a standard data element synchronization updating mechanism. The local data cell library is used for placing local data cells and standard data cells;

and S2.3, carrying out standard verification of data standardization on the local data element by depending on a data element standard verification rule, namely comparing all real data fields of the local data element with the standard data element and forming a related result report.

The method for integrating and processing the data in the S3 comprises the following steps: comprises the following steps:

s3.1, data cleaning and converting, wherein in the data cleaning process, filtering operation is mainly carried out on data which do not meet standard specifications or are invalid in the service data, and different data cleaning is carried out according to different sources of the acquired data; defining a cleaning rule of data before data integration, and setting an error level of the data for the data which accords with the cleaning rule; when data meeting the cleaning rule is encountered in the data integration process, the system sets the service data as problem data, classifies the problem data according to the severity of errors, combines with data quality management, perfects the data management process, expands the data analysis range and establishes an abnormal data searching and analyzing model;

s3.2, loading data, wherein the data loading is to insert the integrated data into a corresponding data table according to a big data standard library structure; the main operations carried out in the data loading process are insertion operation and modification operation; data loading is carried out after the system is updated, and if the data in the database come from a plurality of mutually associated operating systems, the data are ensured to be moved when the systems synchronously work;

and S3.3, data verification, wherein the data verification process is data verification based on the integrated service data, and the data verification is the verification operation of the data of the built standard library through a data quality analysis tool.

The method for loading data in S3.2 comprises the following steps: basic loading, additional loading, destructive merging and constructive merging;

the basic loading is as follows: inputting the converted data into the target table according to the loaded target table; if the target table has data, the data are deleted firstly and then new data are loaded;

the additional loading is as follows: if the data already exists in the target table, adding new data on the basis of storing the existing data; when an incoming data record is repeated with an already existing record, the incoming record may be added as a duplicate, or the newly incoming data discarded;

the destructive merge is: updating the target record data with the new input data if the primary key of the input data record matches the primary key of an existing record; if the input record is a new record without any existing records matching it, then adding the input record to the target table;

the constructive combination is: if the input record primary key is matched with the primary key of the existing record, the existing record is reserved, the input record is added, and the record is marked as the replacement of the old record.

The method for verifying the data in the S3.3 comprises the following steps:

s3.3.1, when data are collected from a business system database to a big data process database, corresponding data standard verification is required; in the aspect of real-time data verification, log storage and query statistics of key information and uniqueness information of access data need to be supported;

s3.3.2, providing a data synchronous check and verification function for data with high requirements on reliability and consistency of access, and outputting a detailed log;

and S3.3.3, supporting statistics of access states in a certain time range, including statistics of data updating, communication states, data volumes and verification conditions.

Compared with the prior art, the invention has the following beneficial effects:

1. the invention collects data from the intermediate database or the front-end processor database, generates data packets according to a defined data exchange mode, and sends the data packets to a receiver through a customized sending route. And the receiver automatically carries out unpacking processing after receiving the data packet and stores the data into a front-end processor exchange database of the receiver. The acquisition content mainly comprises the following steps: full-quantity collection and incremental collection. The collection mode mainly comprises: trigger mode, time stamp mode, full table deletion insertion mode and full table comparison mode. According to different business scenes, the acquired data is classified, and the data comprises structured data, unstructured data and semi-structured data. The capacity of accessing, processing and driving various types of data is provided for the data management device, and the capacity of quickly fusing the accessed structured data, unstructured data and semi-structured data is realized through the cooperative cooperation of multiple engines.

2. The invention establishes a standard data management system and a data application system, effectively combines the processes, strategies and standards in a mode of combining services and technologies, realizes effective management of platform data, and mines and improves the data value. The data management system takes source data and data assets as core management objects, constructs a data processing framework and a data management mechanism, is supported and standardized by a data management flow and a data management team, and provides standard, reliable and timely external data service capability for an application system; the data application system integrates data assets and data services by constructing the data application system, creates a typical data application scene, and realizes comprehensive support of various application fields.

3. According to the invention, through accessing the standardized data after being treated by the data treatment device, multi-dimensional statistical analysis is carried out on the data, and an analysis report is finally formed. When the method is used, code development is not needed, the method can be constructed only by dragging and pulling, and the method is simple and easy to use and has good expandability.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It should be apparent that the drawings in the following description are merely exemplary, and that other embodiments can be derived from the drawings provided by those of ordinary skill in the art without inventive effort.

The structures, ratios, sizes, and the like shown in the present specification are only used for matching with the contents disclosed in the specification, so that those skilled in the art can understand and read the present invention, and do not limit the conditions for implementing the present invention, so that the present invention has no technical significance, and any structural modifications, changes in the ratio relationship, or adjustments of the sizes, without affecting the functions and purposes of the present invention, should still fall within the scope of the present invention.

FIG. 1 is a block diagram of the data acquisition process of the present invention;

FIG. 2 is a block diagram of the data management process of the present invention;

FIG. 3 is a block diagram of the data integration process of the present invention;

Detailed Description

To make the objects, technical solutions and advantages of the embodiments of the present invention clearer and more complete, the technical solutions in the embodiments of the present invention will be described below, obviously, the described embodiments are only a part of the embodiments of the present application, but not all embodiments, and the descriptions are only for further explaining the features and advantages of the present invention, but not for limiting the claims of the present invention; all other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The following detailed description of embodiments of the present invention is provided in connection with the accompanying drawings and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.

In this embodiment, as shown in fig. 1-3, the following steps are included:

step one, collecting data of the whole platform. Collecting data from the intermediate library or the database of the front-end processor, generating a data packet according to a defined data exchange mode, and sending the data packet to a receiver by a customized sending route; the receiver automatically unpacks the data packet after receiving the data packet, and stores the data into a front-end processor exchange database of the receiver; the acquisition content mainly comprises the following steps: full-quantity collection and incremental collection. The collection mode mainly comprises: trigger mode, time stamp mode, full table deletion insertion mode and full table comparison mode. The collection tool mainly comprises: ftp file transmission, a data import tool and a WebService interface; the collected data including structured data, unstructured data and semi-structured data are classified according to different service scenes by establishing data standards, data collection, data processing and data loading and are stored in an original library.

And step two, treating the collected data of the whole platform. The data processing method comprises the steps of collecting source data through a data collecting device, using a big data processing technology to process and optimize dirty data, repeated data and error data in original database data, and using a big data algorithm technology to realize data standardization after processing is completed, so as to quickly realize data processing.

The data filtering method comprises the following steps: selecting data filtering, selecting corresponding field configuration corresponding filtering rules for filtering, and marking and identifying in a labeling mode: standard rules and filtering of mobile phone numbers; unifying the standard rule and filtering of the social credit code; standard rules and filtering of the identity card; the standard rules and filtering of name filtering, and the specific data filtering mode are shown in table 1.

TABLE 1

The data deduplication method comprises the following steps: removing repeated data to ensure uniqueness, determining field ordering of data updating sequence through a time/self-increment main key according to relevant service attributes, configuring and removing repetition, and selecting corresponding fields as conditions for removing repetition; the sequencing rule is as follows: autonomically-activated key sequence; data update timestamp order; a service serial number sequence; other may determine the order of duplicate data updates; judging the basis of duplicate removal; then, whether each field can be repeated is judged, and otherwise, the duplication removal is needed.

in the standardization processing stage, defined data elements, data dictionaries, relevant standard specifications and service flows are arranged in combination with data standard management, and advanced maintenance is performed in a metadata management module to perform unified management on data contents, use ranges, use modes, quality, update modes, update periods, data sources, responsible units, data providers, storage positions, processing logics, data formats and the like so as to manage converged and accessed data resources;

and (3) the standard data elements are synchronized to the local data element library in a manual or system automatic leading-in mode according to the standard data element synchronization updating mechanism. The local data cell library is used for placing local data cells and standard data cells;

and carrying out standard verification of data standardization on the local data element by depending on a data element standard verification rule, namely comparing all real data fields of the local data element with the standard data element and forming a related result report.

And step three, integrating the data.

Data cleaning and converting, wherein the data cleaning process mainly comprises the steps of filtering data which do not meet standard specifications or are invalid in the service data, and cleaning different data according to different sources of the acquired data; defining a cleaning rule of data before data integration, and setting an error level of the data for the data which accords with the cleaning rule; when data meeting the cleaning rule is encountered in the data integration process, the system sets the service data as problem data, classifies the problem data according to the severity of errors, combines with data quality management, perfects the data management process, expands the data analysis range and establishes an abnormal data searching and analyzing model.

The data conversion process comprises data format conversion, code conversion, value conversion and other data conversion modes, and meets the data conversion requirements of the buyer data center. The data sources are various, the unified data format or code specification is not considered in the initial establishing process of the databases, and in order to ensure the specification and consistency of the standard database data, corresponding conversion needs to be performed on the data in the data integration process, and a specific data conversion mode is shown in table 2. For example, conversion of a character date in which the birth date of a person is unified to eight bits belongs to format conversion, conversion of gender data of a person to a national standard gender code is unified to code conversion, and conversion of an identification number of a person to an 18-bit identification number is unified to value conversion.

/>

TABLE 2

Data loading, wherein the data loading is to insert the integrated data into a corresponding data table according to a big data standard library structure; the main operations carried out in the data loading process are insertion operation and modification operation; data loading is performed after the system is updated, and data in the database comes from a plurality of mutually-associated operating systems, so that the data can be guaranteed to move when the systems synchronously work.

The data loading method comprises the following steps: basic loading, additional loading, destructive merging and constructive merging;

the basic loading is: inputting the converted data into the target table according to the loaded target table; if the data exist in the target table, the data are deleted firstly during loading, and then new data are loaded;

the additional loading is as follows: if the data already exists in the target table, adding new data on the basis of storing the existing data; when an incoming data record is repeated with an already existing record, the incoming record may be added as a copy or the newly incoming data discarded;

the destructive merger is: updating the target record data with the new input data if the primary key of the input data record matches the primary key of an existing record; if the input record is a new record without any existing records matching it, then adding the input record to the target table;

the constructive combination is: if the input record primary key is matched with the primary key of the existing record, the existing record is reserved, the input record is added, and the record is marked as a substitute of the old record.

And data verification, wherein the data verification process is data verification based on the integrated service data, and the data verification is the verification operation of the data of the established standard library through a data quality analysis tool.

The data checking method comprises the following steps:

when data is collected from a business system database to a big data process database, corresponding data standard verification is carried out; in the aspect of real-time data verification, log storage and query statistics of key information and uniqueness information of access data need to be supported.

For data with high requirements on reliability and consistency of access, a data synchronization check and verification function is provided, and a detailed log is output.

And the statistics of the access state in a certain time range is supported, including statistics of data updating, communication state, data volume and verification condition.

Although only the preferred embodiments of the present invention have been described in detail, the present invention is not limited to the above embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art, and all changes are encompassed in the scope of the present invention.

Claims

1. A scientific and technological big data intelligent decision analysis method is characterized by comprising the following steps: comprises the following steps:

s1, collecting data of an integral platform;

s2, managing the collected data of the whole platform;

and S3, integrating the data.

2. The scientific and technological big data intelligent decision analysis method according to claim 1, characterized in that: the method for collecting the data of the whole platform in the S1 comprises the following steps: collecting data from the intermediate database or the front-end processor database, generating a data packet according to a defined data exchange mode, and sending the data packet to a receiver by a customized sending route; the receiver automatically unpacks the data packet after receiving the data packet, and stores the data into a front-end processor exchange database of the receiver; the acquisition content mainly comprises the following steps: full collection and incremental collection. The collection mode mainly comprises: trigger mode, time stamp mode, full table deletion insertion mode and full table comparison mode. The collection tool mainly comprises: ftp file transmission, a data import tool and a WebService interface; the collected data including structured data, unstructured data and semi-structured data are classified according to different service scenes by establishing data standards, data collection, data processing and data loading and are stored in an original library.

3. The scientific and technological big data intelligent decision analysis method according to claim 1, characterized in that: the method for managing the data collected by the integral platform in the S2 comprises the following steps: the data processing method comprises the steps of collecting source data through a data collecting device, using a big data processing technology to process and optimize dirty data, repeated data and error data in original database data, and using a big data algorithm technology to realize data standardization after processing is completed, so as to quickly realize data processing.

4. A scientific and technological big data intelligent decision-making analysis method according to claim 3, characterized in that: the method for treating and optimizing dirty data, repeated data and error data in the original database data by using the big data processing technology comprises the following steps: and filtering data which do not meet the standard specification or are invalid in the service data, and configuring different data cleaning processes including data filtering and data deduplication according to different sources of the acquired data.

5. The scientific and technological big data intelligent decision analysis method according to claim 4, characterized in that: the data filtering method comprises the following steps: selecting data filtering, selecting corresponding field configuration corresponding filtering rules for filtering, and marking and identifying in a labeling mode: standard rules and filtering of mobile phone numbers; unifying standard rules and filtering of social credit codes; standard rules and filtering of the identity card; standard rules for name filtering and filtering.

6. The scientific and technological big data intelligent decision analysis method according to claim 4, characterized in that: the data deduplication method comprises the following steps: removing repeated data to ensure uniqueness, determining field ordering of data updating sequence through a time/self-increment key according to related service attributes, configuring and removing duplication, and selecting corresponding fields as duplication removing conditions; the sequencing rule is as follows: an autonomically-enabled key sequence; data update timestamp order; a service serial number sequence; otherwise, the order of updating the repeated data can be determined; judging the basis of duplicate removal; then, whether each field can be repeated is judged, and otherwise, the duplication removal is needed.

7. The scientific and technological big data intelligent decision analysis method according to claim 3, characterized in that: the method for realizing the standardization of the data by utilizing the big data algorithm technology comprises the following steps: comprises the following steps:

and S2.2, docking with the standard data metadata base, and synchronizing the standard data metadata to the local data metadata base in a manual or system automatic import mode according to a standard data metadata synchronization updating mechanism. The local data cell library is used for placing local data cells and standard data cells;

8. The scientific and technological big data intelligent decision analysis method according to claim 1, characterized in that: the method for integrating and processing the data in the S3 comprises the following steps: comprises the following steps:

s3.1, cleaning and converting data, wherein in the process of cleaning the data, filtering operation is mainly carried out on data which do not meet standard specifications or are invalid in service data, and different data cleaning is carried out according to different sources of the acquired data; defining a cleaning rule of the data before data integration, and setting the error level of the data for the data which accords with the cleaning rule; when data meeting the cleaning rule is encountered in the data integration process, the system sets the service data as problem data, classifies the problem data according to the severity of errors, combines the problem data with data quality management, perfects the data management process, expands the data analysis range and establishes an abnormal data searching and analyzing model;

9. The scientific and technological big data intelligent decision analysis method according to claim 8, characterized in that: the method for loading data in S3.2 comprises the following steps: basic loading, additional loading, destructive merging and constructive merging;

the basic loading is as follows: inputting the converted data into the target table according to the loaded target table; if the data exist in the target table, the data are deleted firstly during loading, and then new data are loaded;

the destructive combination is: updating the target record data with the new input data if the primary key of the input data record matches the primary key of an existing record; if the input record is a new record without any existing records matching it, then adding the input record to the target table;

10. The scientific and technological big data intelligent decision analysis method according to claim 8, characterized in that: the method for verifying the data in the S3.3 comprises the following steps:

and S3.3.3, supporting statistics of access states within a certain time range, including statistics of data updating, communication states, data volumes and verification conditions.