CN104516912A - Dynamic data storage method and device - Google Patents

Dynamic data storage method and device Download PDF

Info

Publication number
CN104516912A
CN104516912A CN201310459768.0A CN201310459768A CN104516912A CN 104516912 A CN104516912 A CN 104516912A CN 201310459768 A CN201310459768 A CN 201310459768A CN 104516912 A CN104516912 A CN 104516912A
Authority
CN
China
Prior art keywords
data
row
key
storage
attribute column
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201310459768.0A
Other languages
Chinese (zh)
Other versions
CN104516912B (en
Inventor
苏燕
赵洪松
关德军
李振嘉
段云峰
李红燕
张美鸥
王依兴
孙德志
迟建德
李宏昌
王雅文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile Group Heilongjiang Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile Group Heilongjiang Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile Group Heilongjiang Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN201310459768.0A priority Critical patent/CN104516912B/en
Publication of CN104516912A publication Critical patent/CN104516912A/en
Application granted granted Critical
Publication of CN104516912B publication Critical patent/CN104516912B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention provides a dynamic data storage method. The dynamic data storage method comprises performing integral metadata definition on data to be stored to define the storage strategy of every attribute list of the data to be stored; organizing the attribute lists into different data subsets according to a key-value pair model; according to the storage strategies of the attribute lists, defining physical storage formats for the data subsets, and storing the data subsets according to the physical storage formats. The embodiment of the invention also provides a corresponding dynamic data storage device. According to the dynamic data storage method and device, the dynamic data storage method is achieved through a layered and configurable storage structure and can meet the storage requirements of sparse data sets and dens data sets in mass data treatment simultaneously.

Description

A kind of dynamic date storage method and device
Technical field
The present invention relates to field of computer technology, particularly relate to a kind of dynamic date storage method and device.
Background technology
Along with the high speed development of internet, Internet of Things and ICT industry, associated traffic data presents the situation of explosive growth, and according to IDC report, coming 10 years global metadata amount is by with the speed increment of 40+%.For Business Analysis Support Systemin China Mobile Communication Corporation (calling in the following text through subsystem), along with the introducing of the emerging service data such as internet, data store difficulty, system support pressure increases suddenly, and as the core system of business diagnosis, the support quality of " through subsystem " directly will affect managerial decision, concern operation and the development of company.
Along with the explosive growth through subsystem data amount, how classifying rationally, efficient storage are carried out to all kinds of business datum, realize the high-quality of support system, low cost construction is the problem needing solution at present badly, date storage method conventional at present comprises following two kinds:
Row stores: the memory technology of data warehouse, is carried by Oracle, based on structured data, as: product, bill, service, inventory class business datum.The main advantage that row stores is directly to store by tuple, and the efficiency writing data is higher; Integrality, the reliability of data are high; Be suitable for oltp(On-Line Transaction Processing online transaction processing system) scene.
Row store: the memory technology of internet data platform, is carried by Hadoop Hbase, based on structured, semi-structured data, as: web data, WAP inventory class data.There is not redundant columns when being digital independent in the main advantage that row store, data hit rate is high; The data type of each row (file) is homogeneity, there is not ambiguity, and Data Analysis efficiency is high; Can obtain higher data compression ratio, the treatment effect for sparse data is more obvious; Mode altering cost is low, can ad infinitum arrange by extend property in theory; Be applicable to olap(On-Line Analytical Processing on-line analytical processing) scene.
Row stores and row to store be at present through the main flow storage scheme that subsystem adopts, but along with the introducing of Internet of Things, internet etc. " emerging service data ", the bottleneck of this scheme manifests gradually, major embodiment the following aspects:
Storage scheme is single, there is technology short slab; Waste of storage space situation is serious; The storage scheme of partial service data is reasonable not, very flexible; The mixing of sparse data and dense data, efficient storage cannot be realized; The quality of data of partial service is poor, not accurate enough to the process of " missing values ".
Summary of the invention
In order to solve the problems referred to above that in prior art, data storage aspect exists, the embodiment of the present invention proposes a kind of dynamic date storage method and device.
One aspect of the present invention, provides a kind of dynamic date storage method, comprising:
Overall metadata definition is carried out to data to be stored, defines the storage policy of each attribute column in described data to be stored;
Each attribute column described is woven to different data subsets according to key-value pair model group;
Be described data subset definition physical storage format according to the storage policy of described attribute column, and store described data subset by described physical storage format.
Another aspect of the present invention, provides a kind of dynamic data storage device, comprising:
Definition unit, for carrying out overall metadata definition to data to be stored, defines the storage policy of each attribute column in described data to be stored;
Organization unit, for being woven to different data subsets by each attribute column described according to key-value pair model group;
Storage unit, for being described data subset definition physical storage format according to the storage policy of described attribute column, and stores described data subset by described physical storage format.
Use the disclosed dynamic date storage method of the embodiment of the present invention and device, a kind of dynamic date storage method is realized by layering, configurable storage organization, the storage demand of large Data processing " sparse data and dense data set " can be met simultaneously, solve at present through subsystem " storage scheme single " problem, for the mass data storage such as Internet of Things, internet provides one storage scheme flexibly, the data that can effectively support under " new business pattern " store and analyze;
RCFILE storage means (first horizontal division, vertical division again) inner structure complicated, mode altering cost is high, relatively be suitable for " read-only " data warehouse, and the present invention realizes layering definition, the key assignments separation of storage organization, structure is simple, the degree of coupling is low, can change (as: " increasing or delete columns " only need increase in " the soft mode layer of tabular " configuration or delete a file) by implementation pattern fast;
" all properties row of entity " are processed (first horizontal division according to " principle " by RCFILE, vertical division again), flexibility ratio is inadequate, and the present invention row can be selected pointedly to store for " data characteristics " of entity attribute row or row store, as: the sparse data column selection row for entity store, realize storing and save, store for dense data column selection row, realize efficiently writing;
For the problem of the business ubiquity magnanimity missing data such as Internet of Things, internet, the present invention completes the type definition of missing values, provides a kind of effective technological means for improving large data " analysis quality ";
The technology of this invention realizes the data Layer be positioned at through subsystem " information services domain ", and upper-layer service realizes data interaction by " transparent access layer " between applying, the business degree of coupling is low, processing logic simple and clear, meets through divided data warehouse mass data storage requirements for access completely.
Accompanying drawing explanation
It is the dynamic date storage method schematic diagram that the embodiment of the present invention provides shown in Fig. 1;
It is the schematic diagram of the data store organisation level that the embodiment of the present invention provides shown in Fig. 2;
It is the structurized key assignments model schematic of one that the embodiment of the present invention provides shown in Fig. 3;
It is " multi version " data store organisation schematic diagram of the data object that the embodiment of the present invention provides shown in Fig. 4;
Shown in Fig. 5 be the embodiment of the present invention provide ranks mixing store entirety realize logical schematic;
Figure 6 shows that the entirety that the row that the embodiment of the present invention provides stores realizes logical schematic;
That the entirety that the row that provide of the embodiment of the present invention store realizes logical schematic shown in Fig. 7;
It is the structural representation of the dynamic data storage device that the embodiment of the present invention provides shown in Fig. 8.
Embodiment
In conjunction with the analysis to prior art, build one and both supported that row stored, have and good read performance, support that again row stores, there is good write performance, thus support that " the mixing storage organization " of oltp and olap application exists huge current demand simultaneously, the present invention provides a kind of physical storage format of mix and match by ranks mixing storage layout, the line that not only can realize entity stores or column storage, simultaneously an entity inside, can sparse for different field, the data characteristics such as dense, the row be applicable to is selected to store or row storage format, thus realize the storage of dynamic data, for large data processing provides a kind of dynamic date storage method, the method can adapt to the feature of traditional dense data, the storage demand of magnanimity sparse data can be taken into account again, solve the problem through subsystem magnanimity sparse data waste of storage space.
The applicating example of the heavy point-supported typical storage scenarios of the present invention:
Row stores: for obtaining better write performance, can be row storage format by data integrity, physical arrangements that reliability requirement is higher; As: through the subscriber's meter, bill table, order relations table etc. of subsystem;
Row store: for obtaining better statistical study performance, can data volume is large, that mode altering is frequent, missing values is more physical arrangements be row storage format; As: through the wap gateway table etc. of subsystem;
Ranks mixing stores:
An entity inside, can, for the data characteristics such as sparse, dense of different field, select the row be applicable to store or row storage format, thus the data realizing ranks mixing store; As: the dense data row such as the calling number of voice signaling table, imsi, start time " are stored " by row, the sparse data row such as Original Signaling Point, CGI code " are stored by row ".
Below in conjunction with the dynamic date storage method that the accompanying drawing detailed description embodiment of the present invention provides.
Figure 1 shows that and according to this figure, each step is described in detail in detail as follows by the dynamic date storage method schematic diagram that one embodiment of the invention provides:
Step 101, overall metadata definition is carried out to data to be stored, define the storage policy of each attribute column in described data to be stored;
Wherein, the storage policy of each attribute column in the described data to be stored of described definition, comprising:
Build the soft mode formula of tabular, and increase defines to the pattern of each attribute column described, with visual angle, key-value pair set is defined as the elastic container of data storage.
The storage policy of each attribute column in the described data to be stored of described definition, comprising:
If the data in described attribute column are dense data, then row is adopted to store;
If the data in described attribute column are sparse data, then row is adopted to store.
Step 102, each attribute column described is woven to different data subsets according to key-value pair model group;
Wherein, described each attribute column described is woven to different data subsets according to key-value pair model group, comprises:
Define the key territory in described key-value pair model and codomain respectively, and described key territory and codomain are separately preserved;
Described key territory comprises at least one sub-key, and the right pattern definition of described attribute is kept in described key territory;
Store in described codomain and data content corresponding to described key territory.
Sub-key in described key territory comprises RK, CG and CA, and the combination of described RK, CG and CA key forms inquiry major key, and a data cell is uniquely corresponding.
Step 103, be described data subset definition physical storage format according to the storage policy of described attribute column, and store described data subset by described physical storage format.
Wherein, the described storage policy according to described attribute column is described data subset definition physical storage format, comprising:
Described each attribute column is divided at least one subregion, for each subregion,
Store if described subregion is applicable to row, then by described partition map to an independent file, form a row group;
Store if described subregion is applicable to row, then at least one row in described subregion are stored in a file.
The data content stored in described codomain also comprises timestamp, the order of described data content according to timestamp is arranged.
Said method also comprises: the version number presetting the data content preserved, and preserves the version of predetermined number according to the timestamp of described data content; Or,
Preset the time period of the data content preserved, preserve the version in preset time period according to described timestamp.
In said method, the missing data in described attribute column is defined as disappearance and inapplicable and lack and be suitable for two types.
Use the disclosed dynamic date storage method of the embodiment of the present invention, a kind of dynamic date storage method is realized by layering, configurable storage organization, the storage demand of large Data processing " sparse data and dense data set " can be met simultaneously, solve at present through subsystem " storage scheme single " problem, for the mass data storage such as Internet of Things, internet provides one storage scheme flexibly, the data that can effectively support under " new business pattern " store and analyze;
RCFILE storage means (first horizontal division, vertical division again) inner structure complicated, mode altering cost is high, relatively be suitable for " read-only " data warehouse, and the present invention realizes layering definition, the key assignments separation of storage organization, structure is simple, the degree of coupling is low, can change (as: " increasing or delete columns " only need increase in " the soft mode layer of tabular " configuration or delete a file) by implementation pattern fast;
" all properties row of entity " are processed (first horizontal division according to " principle " by RCFILE, vertical division again), flexibility ratio is inadequate, and the present invention row can be selected pointedly to store for " data characteristics " of entity attribute row or row store, as: the sparse data column selection row for entity store, realize storing and save, store for dense data column selection row, realize efficiently writing;
For the problem of the business ubiquity magnanimity missing data such as Internet of Things, internet, the present invention completes the type definition of missing values, provides a kind of effective technological means for improving large data " analysis quality ";
The technology of this invention realizes the data Layer be positioned at through subsystem " information services domain ", and upper-layer service realizes data interaction by " transparent access layer " between applying, the business degree of coupling is low, processing logic simple and clear, meets through divided data warehouse mass data storage requirements for access completely.
Below by way of concrete application example, the method that the embodiment of the present invention provides is described in detail.
The present invention realizes a kind of dynamic date storage method by layering, configurable storage organization, can meet row, column mixing storage demand that is sparse and dense data set in large data handling procedure simultaneously.Figure 2 shows that the schematic diagram of the data store organisation level that the embodiment of the present invention provides.Three basic steps of the date storage method that the embodiment of the present invention provides are introduced according to this figure:
Step one: the soft mode formula definition of tabular, is equivalent to the definition of overall metadata; Logically be presented as a kind of soft mode formula of tabular, the storage policy of overall each attribute column of definition and data constraint etc.;
Step 2: the key assignments model based on Key-Value defines, is equivalent to the definition of key assignments data definition layer; Right set that all kinds of mixed data set is presented as (key, value), each (key, value), to a corresponding entity property value, is organized as different data subsets by key assignments model each attribute column;
Step 3: optional physical storage format definition, is equivalent to the physical storage layer definition of business datum; According to metadata and the definition of key assignments model, for the definition of each data subset is towards row or the physical storage format towards row. 
Below each hierarchy is described in detail:
Layering, configurable storage organization
The soft mode formula of tabular
Traditional " key-value model " have search rapidly, feature that extendability is high, and a kind of layered mode data of description structure is provided, but its " key-be worth to " only comprises property value, lack pattern definition, and the present invention tradition (key-value) on basis, build a kind of soft mode formula of tabular, incorporate pattern definition, with visual angle, " key-value pair set " is defined as the elastic container of data storage, support mode definition and expansion, comprise following main member, application example can see the following form 1:
Table (Table): the logical storage container of data, shows to be made up of independent row and one or more row group, and independent RowKey row store the keyword of tuple, and row group is defined as the set arranged, different row group non-overlapping copies each other;
Row group (ColumnGroup): The present invention gives two kinds of row group: ColumnGroup_R and ColumnGroup_C, for ColumnGroup_R, data store with the form towards row, and for ColumnGroup_C, data store with the form towards row;
Row (Column): row indistinguishably define entity attributes territory, except RowKey row, remaining row must belong to a certain specific row group;
Key domain file (KeyFile): the physical store container of key numeric field data, the different lines group of usual same table can share key domain file;
Codomain file (DataFile): the physical store container of codomain data, different data files can be mounted to for the feature of forefathers, row, as: a data file can be defined for ColumnGroup_R, corresponding with each attribute column for the one or more data file of ColumnGroup_C definable;
Location type (PosType): the location positioning method defining each attribute column, as: row is fixing, separator etc.;
Positional information (PosValue): according to the detail location of " location type " positioning properties row, as: location type is " separator ", definable the 5th territory is " name " row, or definition position type is " row is fixing ", definable the 30 to the 45 character is " address " row etc.;
Data type (DataType): the data type recording each attribute column.
Table 1: " Main Patterns information " legend of tabular soft mode formula definition
Structurized key assignments model
Key assignments model proposes a kind of structurized (key-value) model define method, key territory is divided into multiple data cell, and adds self-defining data structure to codomain.Figure 3 shows that the structurized key assignments model schematic of one that the embodiment of the present invention provides, introduce this model according to this figure below.
Key territory: the space-filling curve be made up of different sub-keys, comprises the data cells such as pattern information; Key territory is stored by independent physical file, and this file is equivalent to data directory file;
Codomain: the data content in storage key territory, " key-value pair " comprises the chained list of a multiple value composition of continuous print, and each data trnascription has a timestamp, to identify different versions of data; Codomain is stored by independent physical file, and this file is equivalent to data content file;
The present invention uses (key-value) to representing a basic property value, and an entity can comprise multiple attribute, and a data object is mapped in one (key-value) right set, and key territory consists of the following components:
RK: the unique identification of data object (tuple), RK can make (key-value) containing like-identified to pointing to same object;
LT: for locking (key-value) to support transaction semantics, be responsible for data by each application process and write; Be exclusive access by semaphore by tuple identity, release locking after affairs complete;
CG: attribute column array, comprises one or more attribute column;
The addressable group of CA:CG row, preserves the address of one or more attribute column, can be regarded as data cell pointer, points to the data cell of codomain file;
Illustrate: the data cell of concrete file, record, row correspondence is pointed in addressable address, mask content is " fn_idxn_coln ", and each masking unit is separated with " _ ", and each unit is defined as follows:
Fn identification data file, as: the corresponding DataFile2 of corresponding DataFile1, the f2 of f1 etc.;
Idxn is the line displacement amount of certain file, (row can corresponding many records, are distinguished with different timestamps, idxn hit be the maximum record of timestamp, that is: sort at the record of most top layer), as: f1_idx1 is the record that DataFile1 file line offsets after 1 row;
Coln is the n-th row for certain record, as: f1_idx1_col3 is the 3rd row of DataFile1 file the 1st line item, and Column Cata Format content is shown in that PosType defines.
The timestamp of Tn:n edition data;
The data content of Vn:n version.
In figure 3, the key domain model in left side is kept in key domain file, and the codomain model on right side is kept in codomain file; For certain literary name section of key domain model CG territory definition, CA points in territory the data cell of this field at codomain file.
The combination of RK, CG and CA key forms inquiry major key, uniquely can determine a data cell; Consider the high concurrency of data processing, data cell effectively must avoid read/write collision, the present invention assists to carry out transaction consistency control by LT key, multiple versions of the in store data object of each data cell simultaneously, by " timestamp " index between version, timestamp type is 64 integers, is accurate to millisecond, the data acquisition flat structure of different editions, the temporally arrangement of stamp inverted order.
Selectable physical storage format
For the physical storage structure that entity (key-value) is right, the present invention adopts the thinking of horizontal fragmentation to adapt to Data Update, is divided vertically into multiple subregion by each section, for a given subregion, row can be selected to store or row store:
Row stores: be applicable to dense data and store; By whole partition map to an independent file, thus form a row group, this row group can comprise one or more dense data rows;
Row store: be applicable to sparse data and store; According to the data characteristics of each attribute column, the some row in subregion or certain several row can be stored in a physical file, recommend attribute column and physical file to store one to one.
There is the situation of multi version property value for a data object, the present invention adopts and constantly piles up (key-value) right method hereof, data content is temporally stabbed inverted order arrangement; Simultaneously, for the administrative burden (comprising storage and index etc.) avoiding data multi version to cause, the invention provides two kinds of versions of data reclaim mechanisms: one is last n the version preserving data, two is preserve the version (such as nearest 3 days) in nearest a period of time, can carry out specific aim configuration for the data characteristics of each row group, row." multi version " data store organisation schematic diagram of the data object that the embodiment of the present invention provides is shown in Fig. 4.
Below layering, the application of configurable storage organization are described in detail.
Three kinds of examples have been described in detail for the application example of the present invention under the typical storage scenarios such as ranks mixing storage, row storage, row storage below; Application entity table is respectively Table_Test1, Table_Test2, Table_Test3, and three list structures are identical, but the data characteristics such as sparse, dense is different, therefore have chosen different storage meanss respectively.
Ranks mixing stores
Physical contents and ranks memory utilization
As shown in table 2 below, the data of Col1, Col2 row are full, belong to dense data row, row therefore can be adopted to store, be subordinated to row group ColumnGroup_R1; Col3, Col4, Col5 row missing values is more, belongs to sparse data row, therefore adopts row to store, is subordinated to row group ColumnGroup_C1.
Table 2
Table 3
Illustrate: key numeric field data is saved in KeyFile1 file; For codomain data, Col1, Col2 store by row, are all saved in DataFile1 file, Col3, Col4, Col5 store by row, for promoting statistic property, making each row separate, stored, being saved in respectively in DataFile2, DataFile3, DataFile4 file.
Table 4 key assignments model definition (key domain file-KeyFile1)
The value of upper table CA row is equivalent to the file index of five row " codomain content " such as col1 to col5, and CA train value description of contents is as follows:
A) data file that fn is corresponding concrete, as: the corresponding DataFile2 of corresponding DataFile1, the f2 of f1 etc.;
B) idxn is the line displacement amount of certain file, (row can corresponding many records, are distinguished with different timestamps, idxn hit be the maximum record of timestamp, that is: sort at the record of most top layer), as: f1_idx1 is the record that DataFile1 file line offsets after 1 row;
C) coln is the n-th row for certain record, as: f1_idx1_col3 is the 3rd row of DataFile1 file the 1st line item, and Column Cata Format content is shown in that PosType defines.
DataFile1 file content Explanation
Row1,timestamp1,1,LiPing  This line displacement amount is idx1, and separator is ", "
Row2,timestamp2,2,WangLin  This line displacement amount is idx2, and separator is ", "
Row3,timestamp3,3,ZhangLi  This line displacement amount is idx3, and separator is ", "
File layout definition (codomain file-DataFile1) that table 5 stores by row
DataFile2 file content Explanation
Row1,timestamp1,NULL  This line displacement amount is idx1, and separator is ", "
Row2,timestamp2,NULL  This line displacement amount is idx2, and separator is ", "
Row3,timestamp3,2010/2/1  This line displacement amount is idx3, and separator is ", "
File layout definition (codomain file-DataFile2) that table 6 stores by row
DataFile3 file content Explanation
Row1,timestamp1,TRUE  This line displacement amount is idx1, and separator is ", "
Row2,timestamp2,NULL  This line displacement amount is idx2, and separator is ", "
Row3,timestamp3,VACANT  This line displacement amount is idx3, and separator is ", "
File layout definition (codomain file-DataFile3) that table 7 stores by row
DataFile4 file content Explanation
Row1,timestamp1,201  This line displacement amount is idx1, and separator is ", "
Row2,timestamp2,NULL  This line displacement amount is idx2, and separator is ", "
Row3,timestamp3, NULL  This line displacement amount is idx3, and separator is ", "
File layout definition (codomain file-DataFile4) that table 8 stores by row
Figure 5 shows that the entirety that the ranks mixing that the embodiment of the present invention provides stores realizes logical schematic.
Row stores
Physical contents and ranks memory utilization
As shown in table 9 below, the data of Col1, Col2, Col3, Col4, Col5 row are full, belong to dense data row, row all can be adopted to store, be subordinated to row group ColumnGroup_R1.
Table 9
The soft mode formula definition of table 10 tabular
Illustrate: key numeric field data is saved in KeyFile1 file, and codomain data are all saved in DataFile1 file.
Table 11 key assignments model definition (key domain file-KeyFile1)
In table 11, the value of CA row is equivalent to the file index of five row " codomain content " such as col1 to col5, and CA train value description of contents is as follows:
The data file that fn is corresponding concrete, as: the corresponding DataFile1 of f1;
Idxn is the line displacement amount of certain file, (row can corresponding many records, are distinguished with different timestamps, idxn hit be the maximum record of timestamp, that is: sort at the record of most top layer), as: f1_idx1 is the record that DataFile1 file line offsets after 1 row;
Coln is the n-th row for certain record, as: f1_idx1_col3 is the 3rd row of DataFile1 file the 1st line item, and Column Cata Format content is shown in that PosType defines.
File layout definition (codomain file-DataFile1) that table 12 stores by row
Figure 6 shows that the entirety that the row that the embodiment of the present invention provides stores realizes logical schematic. 
Row store
Physical contents and ranks memory utilization
As shown in table 13 below, the missing values of Col1, Col2, Col3, Col4, Col5 row is more, all belongs to sparse data row, adopts row to store, is subordinated to row group ColumnGroup_C1.
Table 13
The soft mode formula definition of table 14 tabular
Illustrate: key numeric field data is saved in KeyFile1 file, and codomain data are saved in DataFile1 file.
Table 15 key assignments model definition (key domain file-KeyFile1)
The value of upper table CA row is equivalent to the file index of five row " codomain content " such as col1 to col5, and CA train value description of contents is as follows:
The data file that fn is corresponding concrete, as: the corresponding DataFile1 of f1;
Idxn is the line displacement amount of certain file, (row can corresponding many records, are distinguished with different timestamps, idxn hit be the maximum record of timestamp, that is: sort at the record of most top layer), as: f1_idx1 is the record that DataFile1 file line offsets after 1 row;
Coln is the n-th row for certain record, as: f1_idx1_col3 is the 3rd row of DataFile1 file the 1st line item, and Column Cata Format content is shown in that PosType defines.
DataFile1 file content Explanation
Row1,timestamp1,1  This line displacement amount is idx1, and separator is ", "
Row2,timestamp2,2  This line displacement amount is idx2, and separator is ", "
Row3,timestamp3,3  This line displacement amount is idx3, and separator is ", "
File layout definition (codomain file-DataFile1) that table 16 stores by row
DataFile2 file content Explanation
Row1,timestamp1,LiPing  This line displacement amount is idx1, and separator is ", "
Row2,timestamp2,VACANT  This line displacement amount is idx2, and separator is ", "
Row3,timestamp3,ZhangLi  This line displacement amount is idx3, and separator is ", "
File layout definition (codomain file-DataFile2) that table 17 stores by row
DataFile3 file content Explanation
Row1,timestamp1,NULL  This line displacement amount is idx1, and separator is ", "
Row2,timestamp2,NULL  This line displacement amount is idx2, and separator is ", "
Row3,timestamp3,2010/2/1  This line displacement amount is idx3, and separator is ", "
File layout definition (codomain file-DataFile3) that table 18 stores by row
DataFile4 file content Explanation
Row1,timestamp1,TRUE  This line displacement amount is idx1, and separator is ", "
Row2,timestamp2,VACANT  This line displacement amount is idx2, and separator is ", "
Row3,timestamp3,VACANT  This line displacement amount is idx3, and separator is ", "
File layout definition (codomain file-DataFile4) that table 19 stores by row
DataFile5 file content Explanation
Row1,timestamp1,201  This line displacement amount is idx1, and separator is ", "
Row2,timestamp2,NULL  This line displacement amount is idx2, and separator is ", "
Row3,timestamp3,NULL  This line displacement amount is idx3, and separator is ", "
File layout definition (codomain file-DataFile5) that table 20 stores by row
Figure 7 shows that the entirety that the row that the embodiment of the present invention provides store realizes logical schematic.
For the processing mode of missing values
There is the problem of a large amount of missing values for the emerging service such as internet, Internet of Things data, the present invention also defines the classification of missing values pointedly, and introducing a kind of " quaternary logic " carrys out the operation of specification to missing values.
The type definition of missing values
In large data handling procedure, except solving the efficient storage of " mixed data set ", the process for sparse data missing values is also a key issue; Missing values refers to the property value of disappearance in data object (tuple), the disappearance of property value comprises multiple situation, the missing values of each type reacts a kind of different conjecture, the disappearance of the present invention's definition comprises applicable and inapplicable two kinds of situations, that is: lack and be suitable for and lack and inapplicable, two kinds of definition have fully demonstrated the form of expression of " null value ":
Lack and inapplicable (Unknown): value exists but cannot know; As: the reasons such as transmission abnormality or gateway backfill data failure cause " network element field " for empty (normal condition does not allow for sky), and now this numerical value is unavailable;
Lack and be suitable for (Nonexistent): value does not exist really; As: certain field is null value (allowing for sky), and now this numerical value can be used;
The present invention's NULL symbol represents the missing values of Unknown type, and VACANT symbol represents the missing values of Nonexistent type, and two symbols need meet following constraint:
NULL and VACANT is two kinds and represents symbol instead of two values, that is: NULL → Unknown, VACANT → Nonexistent;
The NULL of Unknown type can be revised as other values by update operation, but the VACANT of Nonexistent type is well-determined, should not be modified.
Quaternary logic
Missing values is defined as Unknown and Nonexistent two type by the present invention, while realization " symbolic formulation ", " the conditional expression true value " of SQL statement also needs synchronous definition, the present invention uses Maybe symbol table registration, and according to the result compared with NULL, (result is True or False, no matter NULL compares with other data, or compare with another NULL, capital produces Maybe), by Neglect symbol table registration according to the result compared with VACANT, therefore, any condition judgment statement has four kinds of possible true value: True, False, Maybe and Neglect, that is: this two classes missing values relatively produce a kind of quaternary logic (4VL), it is true that logical expression " basic " value has t(), f(is false) and the inapplicable or undefined undefined of i(), each true value is set { t, f, a nonvoid subset of i}.
1、True={t};2、False={f};3、Maybe={t,f};4、Neglect={i}
Table 21 defines for the truth table of quaternary logic
Below introduce the dynamic data storage device of one that the embodiment of the present invention provides, in order to perform the dynamic date storage method provided in previous embodiment.
Figure 8 shows that the structural representation of the dynamic data storage device that one embodiment of the invention provides.The ingredient that this device is described in detail in detail according to this figure is as follows:
Definition unit 801, for carrying out overall metadata definition to data to be stored, defines the storage policy of each attribute column in described data to be stored;
The storage policy of each attribute column in the described data to be stored of described definition unit definition, comprising:
Build the soft mode formula of tabular, and increase defines to the pattern of each attribute column described, with visual angle, key-value pair set is defined as the elastic container of data storage.
The storage policy of each attribute column in the described data to be stored of described definition unit definition, comprising:
If the data in described attribute column are dense data, then row is adopted to store;
If the data in described attribute column are sparse data, then row is adopted to store.
Organization unit 802, for being woven to different data subsets by each attribute column described according to key-value pair model group;
Each attribute column described is woven to different data subsets according to key-value pair model group by described organization unit, comprising:
Define the key territory in described key-value pair model and codomain respectively, and described key territory and codomain are separately preserved;
Described key territory comprises at least one sub-key, and the right pattern definition of described attribute is kept in described key territory;
Store in described codomain and data content corresponding to described key territory.
Storage unit 803, for being described data subset definition physical storage format according to the storage policy of described attribute column, and stores described data subset by described physical storage format.
Described storage unit is described data subset definition physical storage format according to the storage policy of described attribute column, comprising:
Described each attribute column is divided at least one subregion, for each subregion,
Store if described subregion is applicable to row, then by described partition map to an independent file, form a row group;
Store if described subregion is applicable to row, then at least one row in described subregion are stored in a file.
In aforementioned dynamic data storage device, described definition unit also for, the missing data in described attribute column is defined as disappearance and inapplicable and lack and be suitable for two types.
Use the disclosed dynamic data storage device of the embodiment of the present invention, a kind of dynamic date storage method is realized by layering, configurable storage organization, the storage demand of large Data processing " sparse data and dense data set " can be met simultaneously, solve at present through subsystem " storage scheme single " problem, for the mass data storage such as Internet of Things, internet provides one storage scheme flexibly, the data that can effectively support under " new business pattern " store and analyze;
RCFILE storage means (first horizontal division, vertical division again) inner structure complicated, mode altering cost is high, relatively be suitable for " read-only " data warehouse, and the present invention realizes layering definition, the key assignments separation of storage organization, structure is simple, the degree of coupling is low, can change (as: " increasing or delete columns " only need increase in " the soft mode layer of tabular " configuration or delete a file) by implementation pattern fast;
" all properties row of entity " are processed (first horizontal division according to " principle " by RCFILE, vertical division again), flexibility ratio is inadequate, and the present invention row can be selected pointedly to store for " data characteristics " of entity attribute row or row store, as: the sparse data column selection row for entity store, realize storing and save, store for dense data column selection row, realize efficiently writing;
For the problem of the business ubiquity magnanimity missing data such as Internet of Things, internet, the present invention completes the type definition of missing values, provides a kind of effective technological means for improving large data " analysis quality ";
The technology of this invention realizes the data Layer be positioned at through subsystem " information services domain ", and upper-layer service realizes data interaction by " transparent access layer " between applying, the business degree of coupling is low, processing logic simple and clear, meets through divided data warehouse mass data storage requirements for access completely.
One of ordinary skill in the art will appreciate that: all or part of step realizing said method embodiment can have been come by the hardware that programmed instruction is relevant, aforesaid program can be stored in read/write memory medium, this program, when performing, performs the step comprising said method embodiment; And aforesaid storage medium comprises: ROM, RAM, magnetic disc, CD, network node, scheduler etc. various can be program code stored medium.
Last it is noted that these are only the preferred embodiments of the present invention, be not limited to the present invention, although with reference to previous embodiment to invention has been detailed description, for a person skilled in the art, it still can be modified to the technical scheme described in foregoing embodiments, or carries out equivalent replacement to wherein portion of techniques feature.Within the spirit and principles in the present invention all, any amendment done, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.

Claims (15)

1. a dynamic date storage method, is characterized in that, comprising:
Overall metadata definition is carried out to data to be stored, defines the storage policy of each attribute column in described data to be stored;
Each attribute column described is woven to different data subsets according to key-value pair model group;
Be described data subset definition physical storage format according to the storage policy of described attribute column, and store described data subset by described physical storage format.
2. method according to claim 1, is characterized in that, the storage policy of each attribute column in the described data to be stored of described definition, comprising:
Build the soft mode formula of tabular, and increase defines to the pattern of each attribute column described, with visual angle, key-value pair set is defined as the elastic container of data storage.
3. method according to claim 2, is characterized in that, described each attribute column described is woven to different data subsets according to key-value pair model group, comprising:
Define the key territory in described key-value pair model and codomain respectively, and described key territory and codomain are separately preserved;
Described key territory comprises at least one sub-key, and the right pattern definition of described attribute is kept in described key territory;
Store in described codomain and data content corresponding to described key territory.
4. method according to claim 3, is characterized in that, the sub-key in described key territory comprises RK, CG and CA, and the combination of described RK, CG and CA key forms inquiry major key, and a data cell is uniquely corresponding.
5. according to the method in claim 1-4 described in any one, it is characterized in that, the storage policy of each attribute column in the described data to be stored of described definition, comprising:
If the data in described attribute column are dense data, then row is adopted to store;
If the data in described attribute column are sparse data, then row is adopted to store.
6. method according to claim 5, is characterized in that, the described storage policy according to described attribute column is described data subset definition physical storage format, comprising:
Described each attribute column is divided at least one subregion, for each subregion,
Store if described subregion is applicable to row, then by described partition map to an independent file, form a row group;
Store if described subregion is applicable to row, then at least one row in described subregion are stored in a file.
7. method according to claim 3, is characterized in that, the data content stored in described codomain also comprises timestamp, the order of described data content according to timestamp is arranged.
8. method according to claim 7, is characterized in that, also comprises:
Preset the version number of the data content preserved, preserve the version of predetermined number according to the timestamp of described data content; Or,
Preset the time period of the data content preserved, preserve the version in preset time period according to described timestamp.
9. according to the method in claim 1-8 described in any one, it is characterized in that, the missing data in described attribute column is defined as disappearance and inapplicable and lack and be suitable for two types.
10. a dynamic data storage device, is characterized in that, comprising:
Definition unit, for carrying out overall metadata definition to data to be stored, defines the storage policy of each attribute column in described data to be stored;
Organization unit, for being woven to different data subsets by each attribute column described according to key-value pair model group;
Storage unit, for being described data subset definition physical storage format according to the storage policy of described attribute column, and stores described data subset by described physical storage format.
11. devices according to claim 10, is characterized in that, the storage policy of each attribute column in the described data to be stored of described definition unit definition, comprising:
Build the soft mode formula of tabular, and increase defines to the pattern of each attribute column described, with visual angle, key-value pair set is defined as the elastic container of data storage.
12. devices according to claim 11, is characterized in that, each attribute column described is woven to different data subsets according to key-value pair model group by described organization unit, comprising:
Define the key territory in described key-value pair model and codomain respectively, and described key territory and codomain are separately preserved;
Described key territory comprises at least one sub-key, and the right pattern definition of described attribute is kept in described key territory;
Store in described codomain and data content corresponding to described key territory.
13. devices according to any one of claim 10-12, is characterized in that, the storage policy of each attribute column in the described data to be stored of described definition unit definition, comprising:
If the data in described attribute column are dense data, then row is adopted to store;
If the data in described attribute column are sparse data, then row is adopted to store.
14. devices according to claim 13, is characterized in that, described storage unit is described data subset definition physical storage format according to the storage policy of described attribute column, comprising:
Described each attribute column is divided at least one subregion, for each subregion,
Store if described subregion is applicable to row, then by described partition map to an independent file, form a row group;
Store if described subregion is applicable to row, then at least one row in described subregion are stored in a file.
15. devices according to any one of claim 10-14, is characterized in that, described definition unit also for, the missing data in described attribute column is defined as disappearance and inapplicable and lack and be suitable for two types.
CN201310459768.0A 2013-09-29 2013-09-29 A kind of dynamic date storage method and device Active CN104516912B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310459768.0A CN104516912B (en) 2013-09-29 2013-09-29 A kind of dynamic date storage method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310459768.0A CN104516912B (en) 2013-09-29 2013-09-29 A kind of dynamic date storage method and device

Publications (2)

Publication Number Publication Date
CN104516912A true CN104516912A (en) 2015-04-15
CN104516912B CN104516912B (en) 2018-06-26

Family

ID=52792222

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310459768.0A Active CN104516912B (en) 2013-09-29 2013-09-29 A kind of dynamic date storage method and device

Country Status (1)

Country Link
CN (1) CN104516912B (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104866240A (en) * 2015-05-28 2015-08-26 重庆大学 Data storage method for magnetic domain wall memory
CN105706084A (en) * 2013-09-06 2016-06-22 超级医疗***公司 Metadata automated system
CN106156338A (en) * 2016-07-12 2016-11-23 复旦大学无锡研究院 The date storage method of a kind of INFORMATION DISCOVERY server and INFORMATION DISCOVERY method
CN106383844A (en) * 2016-08-31 2017-02-08 天津南大通用数据技术股份有限公司 Storage method and device applied to special data
CN106471501A (en) * 2016-03-24 2017-03-01 华为技术有限公司 The method of data query, the storage method data system of data object
CN108108358A (en) * 2016-11-24 2018-06-01 全球能源互联网研究院 A kind of storage of power quality data and search method
CN109542879A (en) * 2018-11-21 2019-03-29 成都材智科技有限公司 A kind of method and apparatus based on dynamic data Template Manager data
CN109542898A (en) * 2018-10-30 2019-03-29 天津字节跳动科技有限公司 Date storage method, device, electronic equipment and the storage medium of data bank table
CN109669995A (en) * 2018-12-25 2019-04-23 北京锐安科技有限公司 Data storage, quality calculation method, device, storage medium and server
CN110968585A (en) * 2019-12-20 2020-04-07 深圳前海微众银行股份有限公司 Method, device and equipment for storing orientation column and computer readable storage medium
CN111104067A (en) * 2019-12-20 2020-05-05 深圳前海微众银行股份有限公司 Column-oriented caching method, device, equipment and computer-readable storage medium
CN114443670A (en) * 2022-04-07 2022-05-06 北京奥星贝斯科技有限公司 Data storage and reading method and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6173281B1 (en) * 1998-05-22 2001-01-09 International Business Machines Corporation Method and computer program product for processing and combining data sets including bitmaps
CN101021868A (en) * 2007-03-06 2007-08-22 武汉大学 Terrain data storing method based on object storage
CN102117340A (en) * 2011-04-19 2011-07-06 北京神州数码思特奇信息技术股份有限公司 Dynamic data storage method
CN102682108A (en) * 2012-05-08 2012-09-19 同方光盘股份有限公司 Row and line mixed database storage method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6173281B1 (en) * 1998-05-22 2001-01-09 International Business Machines Corporation Method and computer program product for processing and combining data sets including bitmaps
CN101021868A (en) * 2007-03-06 2007-08-22 武汉大学 Terrain data storing method based on object storage
CN102117340A (en) * 2011-04-19 2011-07-06 北京神州数码思特奇信息技术股份有限公司 Dynamic data storage method
CN102682108A (en) * 2012-05-08 2012-09-19 同方光盘股份有限公司 Row and line mixed database storage method

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105706084A (en) * 2013-09-06 2016-06-22 超级医疗***公司 Metadata automated system
CN105706084B (en) * 2013-09-06 2019-08-06 超级医疗***公司 Metadata automated system
CN104866240B (en) * 2015-05-28 2017-11-14 重庆大学 A kind of date storage method for neticdomain wall memory
CN104866240A (en) * 2015-05-28 2015-08-26 重庆大学 Data storage method for magnetic domain wall memory
CN106471501B (en) * 2016-03-24 2020-04-14 华为技术有限公司 Data query method, data object storage method and data system
CN106471501A (en) * 2016-03-24 2017-03-01 华为技术有限公司 The method of data query, the storage method data system of data object
CN106156338A (en) * 2016-07-12 2016-11-23 复旦大学无锡研究院 The date storage method of a kind of INFORMATION DISCOVERY server and INFORMATION DISCOVERY method
CN106383844A (en) * 2016-08-31 2017-02-08 天津南大通用数据技术股份有限公司 Storage method and device applied to special data
CN108108358B (en) * 2016-11-24 2024-02-06 全球能源互联网研究院 Storage and retrieval method for power quality data
CN108108358A (en) * 2016-11-24 2018-06-01 全球能源互联网研究院 A kind of storage of power quality data and search method
CN109542898A (en) * 2018-10-30 2019-03-29 天津字节跳动科技有限公司 Date storage method, device, electronic equipment and the storage medium of data bank table
CN109542879A (en) * 2018-11-21 2019-03-29 成都材智科技有限公司 A kind of method and apparatus based on dynamic data Template Manager data
CN109669995A (en) * 2018-12-25 2019-04-23 北京锐安科技有限公司 Data storage, quality calculation method, device, storage medium and server
CN110968585A (en) * 2019-12-20 2020-04-07 深圳前海微众银行股份有限公司 Method, device and equipment for storing orientation column and computer readable storage medium
CN110968585B (en) * 2019-12-20 2023-11-03 深圳前海微众银行股份有限公司 Storage method, device, equipment and computer readable storage medium for alignment
CN111104067B (en) * 2019-12-20 2024-01-12 深圳前海微众银行股份有限公司 Cache method, device, equipment and computer readable storage medium for alignment
CN111104067A (en) * 2019-12-20 2020-05-05 深圳前海微众银行股份有限公司 Column-oriented caching method, device, equipment and computer-readable storage medium
CN114443670A (en) * 2022-04-07 2022-05-06 北京奥星贝斯科技有限公司 Data storage and reading method and device
CN114443670B (en) * 2022-04-07 2022-07-08 北京奥星贝斯科技有限公司 Data storage and reading method and device

Also Published As

Publication number Publication date
CN104516912B (en) 2018-06-26

Similar Documents

Publication Publication Date Title
CN104516912A (en) Dynamic data storage method and device
CN104794123B (en) A kind of method and device building NoSQL database indexes for semi-structured data
CN104199986A (en) Vector data space indexing method base on hbase and geohash
CN107423422B (en) Spatial data distributed storage and search method and system based on grid
CN102411616B (en) Method and system for storing data and data management method
CN107273482A (en) Alarm data storage method and device based on HBase
TW201800967A (en) Method and device for processing distributed streaming data
CN107807787B (en) Distributed data storage method and system
CN106776967A (en) Mass small documents real-time storage method and device based on sequential aggregating algorithm
CN103617175A (en) Method for virtualization of large-scale distributed heterogeneous data
CN102509012A (en) Method for mapping contents of electronic medical record into electronic medical record standard database
CN102682108B (en) Row and line mixed database storage method
CN102890678A (en) Gray-code-based distributed data layout method and query method
CN102722584B (en) Data storage system and method
CN102411634A (en) Data storage method for improving instantaneity of embedded type database
CN112000851B (en) Key value model, document model and graph model data unified storage method
CN104408128B (en) A kind of reading optimization method indexed based on B+ trees asynchronous refresh
CN103345527B (en) Intelligent data statistical system
CN103064991A (en) Mass data clustering method
CN103500226A (en) Method and device for removing sensitivity of sensitive data
CN109165262B (en) Fragmentation clustering system and fragmentation method of relational large table
CN107301249A (en) A kind of file access information recording method, system and distributed cluster system
CN116719822B (en) Method and system for storing massive structured data
Ptiček et al. Big data and new data warehousing approaches
US8073823B2 (en) Database management program

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant