CN102945241A - Hash data structure used for file comparison,hash comparison system and method - Google Patents

Hash data structure used for file comparison,hash comparison system and method Download PDF

Info

Publication number
CN102945241A
CN102945241A CN2012103330235A CN201210333023A CN102945241A CN 102945241 A CN102945241 A CN 102945241A CN 2012103330235 A CN2012103330235 A CN 2012103330235A CN 201210333023 A CN201210333023 A CN 201210333023A CN 102945241 A CN102945241 A CN 102945241A
Authority
CN
China
Prior art keywords
hash
data
source file
fileinfo
file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2012103330235A
Other languages
Chinese (zh)
Inventor
张星国
刘光喜
成周弦
陈譓瑱
李允珩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NEOWIZ CORP
Original Assignee
NEOWIZ CORP
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by NEOWIZ CORP filed Critical NEOWIZ CORP
Publication of CN102945241A publication Critical patent/CN102945241A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/08Error detection or correction by redundancy in data representation, e.g. by using checking codes
    • G06F11/10Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/70Software maintenance or management
    • G06F8/71Version control; Configuration management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Computer Security & Cryptography (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a hash data structure used for file comparison, a hash comparison system and a method. The hash data comparison system according to the embodiment types can compare the source files by using the hash data comprising the file information and the hash value. The hash data comparison system comprises a file information generation unit, a hash generation unit, and a control unit. The file information generation unit examines the attribute of each of the source files, and generates the file information related to the source files. The hash generation unit can calculate the hash value by applying the hash function algorithm on at least a part of the source files. The control unit can generate the hash data by aiming at the corresponding source files, and the hash value comprises the file information and the hash value. Therefore the invention is advantageous in that all the hash values are not required to be compared in different files, and the files can be compared more quickly.

Description

Be used for file hash data structure and hash comparison system and method relatively
Technical field
Relate generally to of the present invention is used for the hashing technique of data file, more specifically, the hash comparison system and the method that relate to the hash data structure and utilize this hash data structure, it uses hashed value with the unique trait information of source file, therefore carry out more quickly file relatively.
Background technology
In multiple operation, used the comparison between a plurality of data (particularly data file).For example, in fact in multiple operation, used so file relatively, thus check in operating system (OS) thus in file between variation or patch file and source file compared carry out the patch of being scheduled to.
The traditional file comparison techniques of having used comprise the comparison All Files method, with version information distribute to file and based on version information check file method, hash function is applied to file and the method etc. of comparison document then.
Because the speed that exists a large amount of data to compare and compare is slow, so relatively the method for All Files is not used continually.The defective of version information being distributed to the method for file and comparison document is, even the content of file is changed, file content also may be and the version information coupling, unless file version information be changed, thereby because mismatch and correctly do not carry out file relatively so.
Therefore, in most of situations, calculate hashed value by hash function is applied to file, and by the hashed value of relatively calculating the content of comparison document.Yet, only use the problem of this tradition comparative approach of hashed value to be, when the size of file is larger, need more computational resource generate hashed value, and carry out the required time of corresponding operation and increase.
Summary of the invention
Therefore, aim of the present invention is to solve the above problem that occurs in the prior art, and the purpose of this invention is to provide the hash data structure that can utilize resource more in a small amount easily file to be compared mutually.
Another object of the present invention provides about the hash data structure generation method of described structure and hash data structure comparative approach, and it can utilize the more required hash data structure of file to come more rapidly file to be compared mutually.
Another purpose of the present invention provides the hash comparison system, and it can utilize the more required hash data structure of file to come effectively file to be compared mutually.
According to the aspect of the present invention that realizes above-mentioned purpose, a kind of hash data structure has been proposed, this structure comprises: by predetermined data bit that consist of and relevant with the attribute of source file fileinfo and consisted of by the particular data bit and with source file relevant hashed value, wherein, described hash data structure is included in the data bit corresponding with the fileinfo data bit corresponding with hashed value afterwards.
In embodiment, fileinfo can comprise described source file sizes values, comprise described source file beginning data first's data and comprise in the second portion data of last data of described source file at least one.
In embodiment, described hash data structure may further include the structure head, and this structure head comprises and the hashed value that comprises in this hash data structure and each the relevant structural information in the fileinfo.
In embodiment, described hash data structure may further include the parity information with this hash data structurally associated, wherein, described parity information comprises for the first Parity Check Bits of fileinfo and is used for the second Parity Check Bits of hashed value.
According to the another aspect of the present invention that realizes above-mentioned purpose, proposed a kind ofly for generating the hash data generation method of each hash data of reference source file of will being used for, the method may further comprise the steps: (a) check the attribute of each source file and generate the fileinfo that is made of the tentation data bit based on inspection attribute; (b) calculate hashed value by at least a portion that hashing algorithm is applied to described source file; And (c) generate hash data by continuously described hashed value being connected to described fileinfo.
In embodiment, step (a) can comprise: check described source file size, title and form, comprise described source file beginning data first's data and comprise in the second portion data of last data of described source file at least one; And, first's data of the data of the beginning that generate size, title and the form comprise described source file, comprises described source file and comprise at least one described fileinfo in the second portion data of last data of described source file.
In embodiment, this hash data generation method may further include step (d): generate the hash Parity Check Bits that is used for described hash data.
In embodiment, step (d) can comprise: generate the first Parity Check Bits that is used for described fileinfo; Generate the second Parity Check Bits that is used for described hashed value; And by connecting continuously described the first Parity Check Bits and described the second Parity Check Bits generates described hash Parity Check Bits.
According to another aspect of the present invention that realizes above-mentioned purpose, proposed a kind ofly for generating the hash data method of generationing of each hash data of reference source file of will being used for, the method may further comprise the steps: (a) generation comprises the structure head with the hashed value that comprises and each the relevant structural information in the fileinfo in the hash data structure; (b) check the attribute of each source file and based on inspection attribute and generate the fileinfo that is consisted of by the tentation data bit; (c) generate hashed value by at least a portion that hashing algorithm is applied to described source file; And (d) generate hash data by continuously described hashed value being connected to described fileinfo.
According to another aspect of the present invention that realizes above-mentioned purpose, proposed the hash data comparative approach that a kind of hash data that comprises fileinfo and hashed value for utilization compares two source files mutually, the method may further comprise the steps: (a) check two hash datas that are associated with described two source files respectively; Two fileinfos that (b) will comprise in described two hash datas compare mutually; And if (c) described two fileinfos are identical, two hashed values that then will comprise in described two hash datas compare mutually, and if described two hashed values be identical, determine that then described two source files are identical files.
In embodiment, described fileinfo can comprise corresponding source file size, title and form, comprise described source file beginning data first's data and comprise in the second portion data of last data of described source file at least one.
In embodiment, step (b) can comprise that each data bit that will consist of described two fileinfos compares mutually.
In embodiment, step (b) can comprise: for each fileinfo in described two fileinfos, be identified in the source file that comprises in the corresponding document information size, title and form, comprise described source file beginning data first's data and comprise in the second portion data of last data of described source file at least one; And, size, title and the form of the described source file that just has been identified, comprise described source file beginning data first's data and comprise in the second portion data of last data of described source file at least one, mutual more described two fileinfos.
According to another aspect of the present invention that realizes above-mentioned purpose, proposed a kind of utilization comprise fileinfo, hashed value and comprise with fileinfo and hashed value in the hash data of structure head of each relevant structural information hash data comparative approach that two source files are compared mutually, the method may further comprise the steps: (a) the structure head with described two source files compares mutually, and whether definite hash data has identical structure; (b) if determine that described hash data has identical structure, two fileinfos that then will be associated with two source files respectively compare mutually; And if (c) described two fileinfos are identical, the hashed value that then will be associated with two source files respectively compares mutually, and if hashed value be identical, determine that then described two source files are identical files.
According to another aspect of the present invention that realizes above-mentioned purpose, the hash data comparison system that a kind of hash data that comprises fileinfo and hashed value for utilization compares source file has mutually been proposed, this system comprises: fileinfo generation unit, described fileinfo generation unit are constructed to check attribute and the generation fileinfo relevant with described source file of each source file; Hash generation unit, described hash generation unit are constructed to by the hash function algorithm application is calculated hashed value at least a portion of described source file; And control module, described control module is constructed to for respective sources file generated hash data, and described hash data comprises described fileinfo and described hashed value.
In embodiment, described hash data comparison system may further include the hash file administrative unit, and described hash file administrative unit is constructed to store the hash data of generation and keeps and the information relevant with the source file that is associated of the hashed value of storing.
In embodiment, described control module can be by sequentially comparison document information and hashed value are determined the homogeny of described the first source file and described the second source file between the first source file and the second source file.
In embodiment, described control module can generate the structure head that comprises the identifying information relevant with described hashed value with described fileinfo, and generates the described hash data that comprises described structure head, described fileinfo and described hashed value.
In embodiment, described control module can be between described the first source file and described the second source file sequentially comparative structure head, fileinfo and hashed value, if and then structure head, fileinfo and the hashed value of described the first source file and described the second source file are identical, determine that then described the first source file is identical file with described the second source file.
In embodiment, described control module can generate the Parity Check Bits for described hash data, and described Parity Check Bits comprises the Parity Check Bits that is respectively described fileinfo and the calculating of described hashed value.
Description of drawings
According to the detailed description below in conjunction with accompanying drawing, above and other objects of the present invention and feature are with easier to understand, wherein:
Fig. 1 is the reference diagram that illustrates according to the embodiment of hash data structure of the present invention;
Fig. 2 is the reference diagram that illustrates according to another embodiment of hash data structure of the present invention;
Fig. 3 is the reference diagram that illustrates according to another embodiment of hash data structure of the present invention;
Fig. 4 is the structural map that illustrates according to the embodiment of hash comparison system of the present invention;
Fig. 5 is the process flow diagram that the embodiment of the hash data generation method that can be carried out by the hash comparison system of Fig. 4 is shown;
Fig. 6 is the process flow diagram that another embodiment of the hash data generation method that can be carried out by the hash comparison system of Fig. 4 is shown;
Fig. 7 is the process flow diagram that the embodiment of the hash data comparative approach that can be carried out by the hash comparison system of Fig. 4 is shown;
Fig. 8 is the process flow diagram that another embodiment of the hash data comparative approach that can be carried out by the hash comparison system of Fig. 4 is shown; And
Fig. 9 is the structural map that illustrates according to another embodiment of hash comparison system of the present invention.
Embodiment
Disclosed technology only is the embodiment of structure or functional description in the present invention, and therefore the scope of disclosed technology should not be understood to be limited by the embodiment of describing in this instructions.That is, embodiment can be revised in a variety of forms and can have various forms, thereby the scope of disclosed technology should be understood to include the equivalent that can realize technical spirit of the present invention.
The implication of the term of describing in this manual simultaneously, should be appreciated that as follows.
Only be used for parts and other parts are distinguished such as " first " and " second " such word, and scope of the present invention should not limited by these terms.For example, first component can be appointed as second component, and in a similar manner, second component can be appointed as first component.
In whole instructions, it should be understood that, the description that the indication first component ' attach ' to second component can comprise the situation that wherein exists first component in some other parts to be connected to second component between first component and second component, and wherein first component " directly " is connected to the situation of second component.On the contrary, it should be understood that the description of indicating first component " directly " to be connected to second component means do not have parts to be inserted between first component and the second component.Simultaneously, illustrate the relation between parts other description (for example, " and ... between " and " directly exist ... between " or " with ... adjacent " and " directly with ... adjacent ") also can understand in a similar manner.
It should be understood that singular references comprises the plural number statement, unless point out particularly in the text opposite description.In this manual, it should be understood that, only be intended to indication such as " comprising " or " having " such term and have feature, numeral, step, operation, parts, part or its combination, and do not get rid of existence or add one or more further feature, numeral, step, operation, parts, part or its combination.
Reference symbol in each step (such as a, b, c etc.) is used for the convenience of description, and do not indicate the order of each step, and each step can be according to occur in sequence different from the order of describing in instructions, unless limit clearly in the text the concrete order of step.That is, step can according to identical the occurring in sequence of order of describing in this manual, perhaps basically side by side occur, perhaps occur in reverse order.
Unless differently limit, all terms that no person is used herein to comprise technical term or scientific and technical terminology all have with the present invention under the identical meanings usually understood of the those of ordinary skill of technical field.The term identical with those terms that define in the dictionary that usually uses should be understood to have the implication identical with the context implication of prior art, and be not interpreted as desirable or excessive formal implication, unless they are clearly defined in this manual.
In the following description, the file that term " source file " is expressed as follows, this document are that the hash data structure is with the object that is applied to.With the typical characteristics of hashed value similarly, the invention provides the hash data structure that has for the independent values of each source file.
Fig. 1 is the reference diagram that illustrates according to the embodiment of hash data structure of the present invention.
With reference to figure 1, hash data structure 100 comprises fileinfo 110 and hashed value 120.More specifically, hash data structure 100 can be constructed to be included in the data bit bit corresponding with hashed value afterwards about the fileinfo 110 of source file.
Fileinfo 110 can comprise source file sizes values 111, comprise that the partial data 112(of data of the beginning of source file hereinafter is called " first's data ") and comprise that the partial data 113(of the last data of source file hereinafter is called " second portion data ").According to embodiment, fileinfo 110 can be made of at least a in the data 111 to 113 of above-mentioned three types.
In one or more system according to embodiment described later, fileinfo 110 can be constructed to have different length.That is, fileinfo 110 needn't be made of specific data bit, but can be made of the data bit corresponding with pre-sizing according to the setting of system or according to need for environment.
Source file sizes values 111 is data of the size of indication source file.
First's data 112 are first parts than the source file corresponding with predetermined length that rises abruptly from source file, and second portion data 113 are last parts than the source file corresponding with predetermined length that rises abruptly from source file.In this case, can differently determine according to corresponding file comparison system the length of first's data 112 and second portion data 113, thereby the present invention is not by these length restriction.
Hashed value 120 is by hashing algorithm being applied to the data that source file obtains.In embodiment, hashed value 120 can be set to concrete bit.That is, fileinfo 110 is constructed to so that the element and the size of this element that are included in wherein can be changed, and hashed value 120 can be restricted to for example so concrete size (data bit) of standardized size.For example, in the situation of SHA-0 or SHA-1 algorithm, hashed value 120 can have 160 bits, in the situation of SHA-256/224 algorithm, hashed value can have 256/224 bit, and in the situation of SHA-512/384 algorithm, hashed value can have 512/384 bit.In other words, even according to embodiment, hashed value 120 is applied in one or more system, and it also can be made of the data bit of length-specific.That is, because hashed value preferably determines according to standard, so it can be restricted to the data bit of specific size.
When comparison document, fileinfo 110 must compare before hashed value 120.For example, when being described below the example that wherein is desirably in file A search file A in the C, utilize the fileinfo relevant with file A that file A is compared mutually to C, therefore so that can identify file A.In this case, because can only utilize fileinfo 110 to find corresponding file, thus will hashed value mutually not compare, thus can utilize more resource in a small amount to find more quickly the file of expectation.
Fig. 2 is the reference diagram that illustrates according to another embodiment of hash data structure of the present invention.Compare with the embodiment of Fig. 1, the hash data structure further comprises structure head 130 shown in figure 2.
Structure head 130 comprises the information with the structurally associated of fileinfo 110 and hashed value 120.For example, structure head 130 can comprise the information relevant with the total number of bits of the total number of bits of fileinfo 110 and hashed value 120.
In embodiment, structure head 130 can comprise the information relevant with the hash function that is used for calculating hashed value 120.For example, structure head 130 can comprise the information relevant with the hash function that is used for calculating corresponding hashed value 120 (for example SHA-0 or SHA-1).
In embodiment, fileinfo 110 can only comprise at least one in three kinds of data 111 to 113 illustrated in the accompanying drawings, and structure head 130 can provide be included in fileinfo 110 in the relevant information of data.
For example, suppose that document size information and first's data and second portion data are expressed as respectively A, B and C, document size information has two bytes of fixed size, and structure head 130 is made of " 6AB ".In this case, " 6 " in structure head 130 are the values of sum of the byte of indication fileinfo 110, and " AB " indication fileinfo 110 is made of document size information 111 and first's data 112.
In the embodiment of Fig. 2, disclosed hash data structure 100 also can be applied to the situation that the fileinfo 110 that wherein has different length is used by individual system.That is, the reason about this is to utilize structure head 130 to be identified for individually the bit of the element of hash data structure 100.
Fig. 3 is the reference diagram that illustrates according to another embodiment of hash data structure of the present invention.Compare with the embodiment of Fig. 1, hash data structure shown in Figure 3 further comprises parity information 140.
Parity information 140 comprises the parity values for hash data structure 100.
In embodiment, parity information 140 can comprise that (i) is used for the Parity Check Bits of fileinfo 110 and (ii) for the Parity Check Bits of hashed value 120.This plan is used for determining each parity values that because when comparison document, the present invention can only utilize fileinfo 110 to finish comparison.
When the transmission of file etc. occurs, the more effectively execution error inspection and file compared mutually of the embodiment of Fig. 3.
Fig. 4 is the structural map that illustrates according to the embodiment of hash comparison system of the present invention.
Hash comparison system 200 comprises fileinfo generation unit 210, hash generation unit 220, hash file administrative unit 230 and control module 250.In embodiment, hash comparison system 200 may further include source file administrative unit 240.
Fileinfo generation unit 210 can check attribute and the generation fileinfo relevant with source file of source file.Here, the attribute of each source file can comprise size, title, form and the partial data bit (for example, the predetermined length from first data bit or last data bit) etc. of source file.
In embodiment, fileinfo generation unit 210 can be by generating above-mentioned first's data and second portion data from first bit of the data bit of source file and last reads the preset length of source file than rising abruptly data bit.In this case, preset length can be corresponding to first's data of corresponding hash data structure and the size of second portion data.
Hash generation unit 220 can generate hashed value by hash function being applied to each source file.Hash generation unit 220 can use by the hash function of individual system use or based on standard, for example, and based on the hash function of secure hash algorithm (sha).
In embodiment, hash generation unit 220 has a plurality of hash functions and can utilize in response to the request of control module 250 concrete hash function to generate hashed value for source file.
In embodiment, hash generation unit 220 can generate the hashed value that only is used for the part of source file.For example, when the size of source file was equal to or greater than predetermined value, hash generation unit 220 can generate the hashed value for the part of the source file corresponding with default size.In another embodiment, hash generation unit 220 can also generate the hashed value that only is used for the remainder of the source file except first's data and second portion data.
Hash file administrative unit 230 can the managed source file and with the corresponding hash file of source file (structure).For example, hash file administrative unit 230 storage hash files and keeping and the information relevant with the source file that mates of corresponding hash file (for example, link information etc.).
Source file administrative unit 240 can be stored source file and be kept the historical record of each source file.For example, if determine that file A changes because file A is carried out hash relatively, then corresponding file A and its hash historical record can be stored in the source file administrative unit 240.
Control module 250 can generate the hash data structure by the overall operation of control hash comparison system 200 or source file is compared mutually.
In embodiment, control module 250 can generate the hash data structure (file) for each source file.More specifically, control module 250 concrete source file can be provided to fileinfo generation unit 210 and hash generation unit 220 the two, and utilize the hashed value and the fileinfo that have received in response to described concrete source file to generate the hash data structure.With reference to Fig. 5 and Fig. 6 the embodiment relevant with the generation of hash data structure described in further detail.
In embodiment, control module 250 can utilize the hash data structure that two source files are compared mutually.Hash data structure according to the present invention is divided into fileinfo and hashed value, and utilizes so that architectural feature compares source file mutually.More specifically, control module 250 analysis is treated mutually the hash data structure of source file relatively, and determines whether identical file of source file by the fileinfo that utilizes the hash data structure.If determine that source file is identical file, then control module 250 checks by the hashed value of utilizing the hash data structure whether source file has identical content.The present invention at first carries out and utilizes fileinfo to determine the whether step of identical file of file, and if only file be determined to be identically, then carry out the step that between hashed value, compares, therefore compare more rapidly.
In embodiment, when mutual comparison document information, control module 250 can compare each data bit that consists of described fileinfo mutually.In another embodiment, control module 250 can be identified each element that consists of described fileinfo, and can pass through the mutual relatively element through identifying and more described fileinfo.That is, for each fileinfo, identify at least one in size, title, form, first's data and the second portion data that are included in the source file in the corresponding fileinfo, and can compare with the element of another fileinfo through the element of identification.
In embodiment, control module 250 can be provided to hash file administrative unit 230 with the hash file that generates and the source file information that is associated with described hash file, thereby hash file can be managed.Control module 250 is provided to hash file administrative unit 230 with the hash file that generates, thereby hash file is stored in the hash file administrative unit 230.When receiving for such as the request of another more such operation of hash the time, can be from hash file administrative unit 230 to control module 250 provide and the corresponding hash file of particular source file, thereby can carry out predetermined operation.
In embodiment, control module 250 can be controlled source file administrative unit 240, thereby generates the historical record of source file.For example, when patch etc. occurring for identical source file, can require the historical record of patch.In the situation of this example, control module 250:(i) as the result who between source file, compares, utilize fileinfo to determine whether identical source file of source file, if and (ii) utilize hashed value to determine that the content of file has variation, information that then will be relevant with the respective sources file and be provided to source file administrative unit 240 with the information of hash data structurally associated is therefore so that can generate historical record.
In embodiment, control module 250 can be for each hash data structural generation structure head.More specifically, when providing fileinfo and hashed value by fileinfo generation unit 210 and hash generation unit 220 respectively, control module 250 can be for hash data structural generation structure head, thereby can identify fileinfo and hashed value.For example, control module 250 can generate the structure head that comprises the information that the element that comprises, the data length of each element, the length of hashed value etc. are indicated in fileinfo 110.In this embodiment, when mutual comparison of hashed data structure, then control module 250 at first analytical structure head and determines whether identical file of two source files to be compared based on fileinfo with identification fileinfo and hashed value.If determine that source file is identical file, then control module 250 can be by coming relatively mutually to determine the hashed value of file whether the content of file changes.
In embodiment, control module 250 can generate parity information and add this parity information to each hash data structure.More specifically, control module 250 can generate for the Parity Check Bits of fileinfo and be used for the Parity Check Bits of hashed value, and can generate the parity information that comprises above-mentioned two Parity Check Bits.This embodiment can be applied to wherein the situation of transmission that the hash data structure occurs etc. between different systems.For fileinfo and the hashed value of hash data structure, calculate respectively Parity Check Bits, thereby when the hash data structure is compared mutually, can carry out more rapidly parity-check operations.
Fig. 5 is the process flow diagram that the embodiment of the hash data generation method that can be carried out by the hash comparison system of Fig. 4 is shown.
With reference to figure 5, at step S510, fileinfo generation unit 210 can check the attribute of each source file under the control of control module 250.In this case, attribute is the data that are collected with spanned file information, and as mentioned above, can be file size, file name, file layout, first's data or second portion data etc.
At step S520, fileinfo generation unit 210 can be based on the attribute spanned file information on inspection of source file.With hash data mutually relatively the time, fileinfo is used to determine whether identical file of two source files being compared.As mentioned above, fileinfo can comprise at least one in file size, first's data or the second portion data.Alternatively, fileinfo can comprise file name or file layout.Fileinfo generation unit 210 is provided to control module 250 with the fileinfo that generates.
At step S530, hash generation unit 220 can generate under the control of control module 250 and the corresponding hashed value of each source file.In embodiment, hash generation unit 220 can have various hashing algorithms, and can utilize hashing algorithm by control module 250 request to generate hashed value for source file.In embodiment, hash generation unit 220 can only utilize the part of source file to generate hashed value under the control of control module 250.Hash generation unit 220 is provided to control module 250 with the hashed value that generates.
At step S540, control module 250 can utilize fileinfo and hashed value to generate hash data.Control module 250 can generate hash data by being connected to the corresponding data bit of fileinfo with the corresponding data bit of hashed value continuously.In this embodiment, control module 250 can be known fileinfo since the first bit in advance until which bit finishes.Therefore, when control module 250 is carried out control so that when fileinfo generation unit 210 and hash generation unit 220 spanned file information and hashed value, can make so request for the generation of fileinfo and hashed value, comprise the information relevant with the size of data.
Fig. 6 is the process flow diagram that another embodiment of the hash data generation method that can be carried out by the hash comparison system of Fig. 4 is shown.The embodiment of Fig. 6 relates to the embodiment that wherein utilizes the said structure head and generate hash data.Obtain the embodiment of Fig. 6 by the embodiment that predetermined process is added to Fig. 5, thereby will describe briefly same or analogous those steps of step in the embodiment with Fig. 5.
With reference to figure 6, at step S610, control module 250 can determine to be included in the element in the fileinfo in advance.That is, control module 250 can determine to be included in the type of the element in the fileinfo, the size of element etc. in advance, and the maintenance information relevant with the structure of fileinfo.Afterwards, control module 250 can demand file information generating unit 210 spanned file information, comprise the information relevant with the element of determining.
Fileinfo generation unit 210 can be under the control of control module 250 spanned file information.That is, fileinfo generation unit 210 can check at step S620 the attribute of each source file, utilizes on inspection attribute spanned file information at step S630, and fileinfo is provided to control module 250.
At step S640, hash generation unit 220 can generate the hashed value that is used for source file under the control of control module 250, and this hashed value is provided to control module 250.
At step S650, control module 250 can generate the structure head about fileinfo and hashed value.As mentioned above, the structure head can comprise the information with the structurally associated of hash data.Be that about the reason of using the structure head the present invention will separate with hashed value from the fileinfo in the hash data, and then individually each of fileinfo and hashed value compared.In embodiment, control module 250 can be before fileinfo and hashed value be generated the generating structure head.Namely, because when the structure of fileinfo and hashed value in the requested situation of the generation of fileinfo and hashed value (for example, when the size of the element of fileinfo, the size of fileinfo and hashed value etc.) also requested, even also do not receive fileinfo and hashed value, the structure head also can be generated, so this operation is possible.In another embodiment, control module 250 can receive fileinfo and hashed value individually, and generates afterwards the structure head about them.That is, when fileinfo generation unit 210 and hash generation unit 220 respectively independently when spanned file information and hashed value, control module 250 can receive fileinfo and hashed value dividually, and can the generating structure head.
In case generating structure head, control module 250 can generate hash data based on structure head, fileinfo and hashed value at step S660.
Fig. 7 is the process flow diagram that the embodiment of the hash data comparative approach that can be carried out by the hash comparison system of Fig. 4 is shown.The hash data comparative approach shown in Fig. 7 be with at the corresponding embodiment of the hash data generation method shown in Fig. 5.
With reference to figure 7, at step S710, control module 250 can be selected respectively and two hash datas that source file is associated to be compared.In the situation of the embodiment that comprises hash file administrative unit 230, control module 250 can be to the hash data of hash file administrative unit 230 request about two source files to be compared, and obtain described hash data from hash file administrative unit 230.
At step S720, control module 250 can be checked the structure of the hash data of two selections.That is, control module 250 which partly respectively respective file information and the hashed value that can check each hash data.
At step S730, control module 250 can compare the fileinfo that is included in described two hash datas mutually, and then at first determines whether identical file of two source files.For example, when file name, file size etc. is included in the fileinfo, can utilize fileinfo at first to determine whether identical file of two source files, then can determine file content.The present invention be constructed to about two files wanting comparison whether identical file determine the homogeny of object, if and file is determined to be identical object, then about the content of two objects identical homogeny of determining contents of object whether, therefore finished relatively.
If at step S740, described two fileinfos are identical (in situations of "Yes"), and then control module 250 can will compare with the hashed value that described two source files are associated mutually at step S750.
If also be identical (in the situation in "Yes") in step S760 hashed value, then determine that at step S770 described two source files are identical files.
If at step S740, described two fileinfos are different (in the situations of "No") each other, be different (in the situations in "No") each other in step S760 hashed value perhaps, then can determine that described two source files are different files at step S771.
In above-mentioned steps, when with fileinfo or hashed value mutually relatively the time, control module 250 can compare by the data bit that checks corresponding object to be compared.Therefore, if only utilize fileinfo to determine that source file is different file, then the quantity of data bit is significantly reduced.Therefore, in the time must comparing according to the relation of 1:N, for example, when carrying out finding the operation of the file identical with the particular source file in the middle of a plurality of files, the present invention can compare effectively.
Fig. 8 is the process flow diagram that another embodiment of the hash data comparative approach that can be carried out by the hash comparison system of Fig. 4 is shown.The hash data comparative approach shown in Fig. 8 be with at the corresponding embodiment of the hash data generation method shown in Fig. 6, wherein, further comprise the structure head at the hash data shown in Fig. 8.Therefore, in this embodiment, with describe briefly with the embodiment shown in Fig. 7 in same or analogous those steps of step.
With reference to figure 8, at step S810, control module 250 can be selected respectively and two hash datas that source file is associated to be compared.
At step S820, control module 250 can check the structure head of the hash data of two selections, and analytical structure head then.As mentioned above, because each structure head is included in length of the content of the fileinfo that comprises in the corresponding hash data and length, hashed value etc., so control module 250 can be identified by the analytical structure head each element of hash data.
Control module 250 compares the structure head of described two hash datas mutually, and be identical (be the situation of "Yes" at step S830) such as the fruit structure head, then can be identified in fileinfo and the hashed value that comprises in each hash data at step S840.
At step S850, control module 250 can compare the fileinfo that is included in described two hash datas mutually, and then at first determines whether identical file of two source files.
If at step S860, described two fileinfos are identical (in situations of "Yes"), and then control module 250 can compare mutually in the hashed value that step S870 will be associated with described two source files respectively.
If be identical (in the situation in "Yes") in step S880 hashed value, then determine that at step S890 described two source files are identical files.
If be (in the situation in "No") that differs from one another at step S830 structure head, if be (in the situation in "No") that differs from one another at step S860 fileinfo, be (in the situation in "No") that differs from one another in step S880 hashed value perhaps, then can determine that at step S891 described two source files are different files.
Can utilize the identification of structure head to consist of fileinfo and the hashed value of hash data at the embodiment shown in Fig. 8.In the system that fileinfo and hashed value are differently used, this embodiment can be more effective.And, at step S830 because utilize structure head itself can determine the homogeny of file, so can be more fast and determine exactly the homogeny of file, therefore effectively make comparison.
Fig. 9 is the structural map that illustrates according to another embodiment of hash comparison system of the present invention.The embodiment that can be applied to wherein with the situation of the mutual comparison document of relation of 1:N at the hash comparison system shown in Fig. 9.This system is constructed to: at first only fileinfo is compared mutually, utilize to have file generated first comparative group of same file information, and the hashed value that only will belong to the file of the first comparative group compares mutually.
With reference to figure 9, hash comparison system 200 comprises fileinfo generation unit 210, hash generation unit 220, control module 250 and hash comparing unit 260.In embodiment, hash comparison system 200 may further include at least one in hash file administrative unit 230 and the source file administrative unit 240.In the description of the embodiment shown in Fig. 9, will omit or carry out briefly the description of the same or analogous parts of parts in the embodiment with Fig. 4.
Control module 250 can be selected the file identical with source file A from obj ect file group B.For this reason, control module 250 can be selected the hash data that is associated with the All Files that comprises in obj ect file group B, selects the hash data of source file A, and the hash data of selecting is compared mutually.In relatively, control module 250 can be divided into each hash data fileinfo and hashed value, and at first only fileinfo is compared mutually.That is, control module 250 can be with the fileinfo of source file A and the fileinfo comparison of the obj ect file that comprises in obj ect file group B, and classification has the obj ect file of same file information, and then generates the first comparative group.Afterwards, then control module 250 can utilize the hashed value of the obj ect file that hash comparing unit 260 will comprise and source file A in the first comparative group hashed value relatively and determines identical file.
Hash comparing unit 260 can only compare hashed value under the control of control module 250 mutually.In disclosed embodiment, hash comparing unit 260 is set to only hashed value be compared individually, therefore requires therein to search more effectively execution comparison in the plain situation with the relation of 1:N.
According to disclosed technology in the present invention, the hashed value of file can determined whether file is mutually different before relatively mutually, thereby therefore all hash datas that needn't more different files obtain the advantage that can more rapidly file be compared mutually.
In addition, advantage in technology disclosed by the invention is, can utilize by the Parity Check Bits that is used for fileinfo and be used for the parity information that the Parity Check Bits of hashed value consists of and check each of whether correctly constructing fileinfo and hash data structure.
Although preferred implementation of the present invention openly has been used for illustrative purpose, but skilled person will appreciate that, in the situation of disclosed scope and spirit of the present invention, various modifications, interpolation and replacement are possible in not breaking away from such as the claims of enclosing.

Claims (14)

1. one kind be used for to generate the hash data generation method of each hash data of reference source file of will being used for, and the method may further comprise the steps:
(a) check the attribute of each source file and based on inspection attribute and generate the fileinfo that is consisted of by the tentation data bit;
(b) calculate hashed value by at least a portion that hashing algorithm is applied to described source file; And
(c) generate hash data by continuously described hashed value being connected to described fileinfo.
2. hash data generation method according to claim 1, wherein, step (a) comprising:
Check described source file size, title and form, comprise described source file beginning data first's data and comprise in the second portion data of last data of described source file at least one; And
Generation comprise described source file size, title and form, comprise described source file beginning data described first data and comprise at least one described fileinfo in the described second portion data of last data of described source file.
3. hash data generation method according to claim 1, the method further comprises step (d): generate the hash Parity Check Bits that is used for described hash data.
4. hash data generation method according to claim 3, wherein, step (d) comprising:
Generate the first Parity Check Bits that is used for described fileinfo;
Generate the second Parity Check Bits that is used for described hashed value; And
By connecting continuously described the first Parity Check Bits and described the second Parity Check Bits generates described hash Parity Check Bits.
5. hash data comparative approach that be used for to utilize the hash data that comprises fileinfo and hashed value that two source files are compared mutually, the method may further comprise the steps:
(a) check two hash datas that are associated with described two source files respectively;
Two fileinfos that (b) will comprise in described two hash datas compare mutually; And
(c) if described two fileinfos are identical, two hashed values that then will comprise in described two hash datas compare mutually, and if described two hashed values be identical, determine that then described two source files are identical files.
6. hash data comparative approach according to claim 5, wherein, described fileinfo comprise the respective sources file size, title and form, comprise described source file beginning data first's data and comprise in the second portion data of last data of described source file at least one.
7. hash data comparative approach according to claim 6, wherein, step (b) comprises that each data bit that will consist of described two fileinfos compares mutually.
8. hash data comparative approach according to claim 6, wherein, step (b) comprising:
For each fileinfo in described two fileinfos, be identified in the source file that comprises in the corresponding document information size, title and form, comprise described source file beginning data first's data and comprise in the second portion data of last data of described source file at least one; And
Size, title and the form of the described source file that just has been identified, comprise described source file beginning data described first data and comprise in the described second portion data of last data of described source file at least one, described two fileinfos are compared mutually.
9. hash data comparison system that be used for to utilize the hash data that comprises fileinfo and hashed value that source file is compared mutually, this system comprises:
Fileinfo generation unit, described fileinfo generation unit are constructed to check attribute and the generation fileinfo relevant with described source file of each source file;
Hash generation unit, described hash generation unit are constructed to by the hash function algorithm application is calculated hashed value at least a portion of described source file; With
Control module, described control module are constructed to for respective sources file generated hash data, and described hash data comprises described fileinfo and described hashed value.
10. hash data comparison system according to claim 9, this system further comprises the hash file administrative unit, and described hash file administrative unit is constructed to store the hash data of generation and keeps and the information relevant with the source file that is associated of the hashed value of storing.
11. hash data comparison system according to claim 9, wherein, described control module is by sequentially comparison document information and hashed value are determined the homogeny of described the first source file and described the second source file between the first source file and the second source file.
12. hash data comparison system according to claim 9, wherein, described control module generates the structure head that comprises the identifying information relevant with described hashed value with described fileinfo, and generates the described hash data that comprises described structure head, described fileinfo and described hashed value.
13. hash data comparison system according to claim 12, wherein, described control module sequentially comparative structure head, fileinfo and hashed value between the first source file and the second source file, if then structure head, fileinfo and the hashed value of described the first source file and described the second source file are identical, determine that then described the first source file is identical file with described the second source file.
14. hash data comparison system according to claim 9, wherein, described control module generates the Parity Check Bits that is used for described hash data, and described Parity Check Bits comprises the Parity Check Bits that is respectively described fileinfo and the calculating of described hashed value.
CN2012103330235A 2011-10-28 2012-09-10 Hash data structure used for file comparison,hash comparison system and method Pending CN102945241A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
KR10-2011-0111296 2011-10-28
KR1020110111296A KR101310253B1 (en) 2011-10-28 2011-10-28 Hash data creation method and hash data comparison system and method

Publications (1)

Publication Number Publication Date
CN102945241A true CN102945241A (en) 2013-02-27

Family

ID=47728187

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2012103330235A Pending CN102945241A (en) 2011-10-28 2012-09-10 Hash data structure used for file comparison,hash comparison system and method

Country Status (4)

Country Link
KR (1) KR101310253B1 (en)
CN (1) CN102945241A (en)
TW (1) TW201319929A (en)
WO (1) WO2013062223A1 (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103699610A (en) * 2013-12-13 2014-04-02 乐视网信息技术(北京)股份有限公司 Method for generating file verification information, file verifying method and file verifying equipment
WO2014206223A1 (en) * 2013-06-27 2014-12-31 华为终端有限公司 Method, server, and client for securely accessing web application
US20170017798A1 (en) * 2015-07-17 2017-01-19 International Business Machines Corporation Source authentication of a software product
CN106471767A (en) * 2014-07-04 2017-03-01 国立大学法人名古屋大学 Communication system and key information sharing method
CN107133120A (en) * 2016-02-29 2017-09-05 阿里巴巴集团控股有限公司 A kind of method of calibration of file data, device
CN110197005A (en) * 2019-05-07 2019-09-03 珠海格力电器股份有限公司 Automatic identification method and device for CAE model of air conditioner
CN110990897A (en) * 2019-12-16 2020-04-10 北京无忧创想信息技术有限公司 File fingerprint generation method and device

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015060494A1 (en) * 2013-10-21 2015-04-30 주식회사 리얼타임테크 Apparatus for automatically updating record id of navigation network data and method for same
US9811333B2 (en) 2015-06-23 2017-11-07 Microsoft Technology Licensing, Llc Using a version-specific resource catalog for resource management
KR20220041394A (en) * 2020-09-25 2022-04-01 삼성전자주식회사 Electronic device and method for managing non-destructive editing contents

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050091497A1 (en) * 2002-07-01 2005-04-28 Canon Kabushiki Kaisha Imaging apparatus
CN101354708A (en) * 2008-07-29 2009-01-28 四川大学 Remote file rapid synchronization method
CN101582076A (en) * 2009-06-24 2009-11-18 浪潮电子信息产业股份有限公司 Data de-duplication method based on data base

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4049498B2 (en) * 1999-11-18 2008-02-20 株式会社リコー Originality assurance electronic storage method, apparatus, and computer-readable recording medium
JP2000357115A (en) 1999-06-15 2000-12-26 Nec Corp Device and method for file retrieval
JP2006053836A (en) 2004-08-13 2006-02-23 Fuji Electric Systems Co Ltd Authenticity determination apparatus, and system for storing and utilizing electronic file
US20110145259A1 (en) 2009-12-11 2011-06-16 Pitney Bowes Inc. System and method for identifying data fields for remote address cleansing

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050091497A1 (en) * 2002-07-01 2005-04-28 Canon Kabushiki Kaisha Imaging apparatus
CN101354708A (en) * 2008-07-29 2009-01-28 四川大学 Remote file rapid synchronization method
CN101582076A (en) * 2009-06-24 2009-11-18 浪潮电子信息产业股份有限公司 Data de-duplication method based on data base

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014206223A1 (en) * 2013-06-27 2014-12-31 华为终端有限公司 Method, server, and client for securely accessing web application
US9830454B2 (en) 2013-06-27 2017-11-28 Huawei Device (Dongguan) Co., Ltd. Web application security access method, server, and client
CN103699610A (en) * 2013-12-13 2014-04-02 乐视网信息技术(北京)股份有限公司 Method for generating file verification information, file verifying method and file verifying equipment
CN106471767A (en) * 2014-07-04 2017-03-01 国立大学法人名古屋大学 Communication system and key information sharing method
CN106471767B (en) * 2014-07-04 2019-12-24 国立大学法人名古屋大学 Communication system and key information sharing method
US20170017798A1 (en) * 2015-07-17 2017-01-19 International Business Machines Corporation Source authentication of a software product
US9965639B2 (en) * 2015-07-17 2018-05-08 International Business Machines Corporation Source authentication of a software product
US20180225470A1 (en) * 2015-07-17 2018-08-09 International Business Machines Corporation Source authentication of a software product
US10558816B2 (en) * 2015-07-17 2020-02-11 International Business Machines Corporation Source authentication of a software product
CN107133120A (en) * 2016-02-29 2017-09-05 阿里巴巴集团控股有限公司 A kind of method of calibration of file data, device
CN110197005A (en) * 2019-05-07 2019-09-03 珠海格力电器股份有限公司 Automatic identification method and device for CAE model of air conditioner
CN110990897A (en) * 2019-12-16 2020-04-10 北京无忧创想信息技术有限公司 File fingerprint generation method and device

Also Published As

Publication number Publication date
KR101310253B1 (en) 2013-09-24
WO2013062223A1 (en) 2013-05-02
KR20130046746A (en) 2013-05-08
TW201319929A (en) 2013-05-16

Similar Documents

Publication Publication Date Title
CN102945241A (en) Hash data structure used for file comparison,hash comparison system and method
US9710503B2 (en) Tunable hardware sort engine for performing composite sorting algorithms
US8924687B1 (en) Scalable hash tables
CN101796492B (en) Cluster storage using subsegmenting
JP5466257B2 (en) Table search method
US9043293B2 (en) Table boundary detection in data blocks for compression
EP3072076B1 (en) A method of generating a reference index data structure and method for finding a position of a data pattern in a reference data structure
US20150095345A1 (en) Information processing device
US9971793B2 (en) Database management system and database management method
US10810174B2 (en) Database management system, database server, and database management method
WO2018056992A1 (en) Techniques for in-memory key range searches
KR101201626B1 (en) Apparatus for genome sequence alignment usting the partial combination sequence and method thereof
CN105447166A (en) Keyword based information search method and system
US20160203032A1 (en) Series data parallel analysis infrastructure and parallel distributed processing method therefor
JP4491480B2 (en) Index construction method, document retrieval apparatus, and index construction program
KR101757253B1 (en) Method and apparatus for managing multidimensional data
CN109977113A (en) A kind of HBase Index Design method based on Bloom filter for medical imaging data
CN115801765A (en) File transmission method, device, system, electronic equipment and storage medium
JP6366812B2 (en) Computer and database management method
US10628488B2 (en) Document retrieval system and retrieval method
WO2022248045A1 (en) Method of data management in data storage system, data indexing module, and data storage system
KR102544899B1 (en) Embedding blockchain method and system using external storage media
CN113282423B (en) Deployment method, system and computer readable storage medium
US20160070715A1 (en) Storing data in a distributed file system
JP5709982B2 (en) Database device, database system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20130227