Summary of the invention
The invention provides and a kind ofly can accurately differentiate repeating data, and remove the method for repeating objects based on metadata what repeating data was removed.
The present invention is by the following technical solutions: the present invention is based on the method that metadata is removed repeating objects, comprise the steps:
1) the current metadata of typing for the treatment of is carried out standardization processing, judge whether it is that quality is treated the typing metadata preferably;
2) quality is treated preferably each bar record compares in typing metadata and the data acquisition, whether had and the record for the treatment of that the typing metadata repeats in the judgment data set;
3), among the two, choose the measured record of matter as data acquisition if duplicate record is arranged.
The described current metadata of typing for the treatment of comprises following field at least: International Standard Book Number, title, author, publishing house, publication time, price field.
Described International Standard Book Number is made up of 10 bit digital, and this 10 bit digital is made up of group number, publisher number, punctuation marks used to enclose the title, verification number this four part, uses "--" to link to each other therebetween, and publisher number is the code name of publishing house.
Described " the current metadata of typing for the treatment of is carried out standardization processing " comprises the steps:
1) judges whether the current International Standard Book Number of the metadata of typing for the treatment of contains nonnumeric character; If nonnumeric character is arranged, after this nonnumeric character deletion, keep this current metadata for the treatment of typing;
2) do you judge that the current International Standard Book Number of the metadata of typing for the treatment of is made up of 10 bit digital? if International Standard Book Number is not 10 bit digital, then be divided into two kinds of situations and handle: International Standard Book Number is less than 8, then abandons this current metadata for the treatment of typing; International Standard Book Number surpasses 10, then with after 10 later digit deletions, keeps this current metadata for the treatment of typing;
3) whether the International Standard Book Number of the current metadata for the treatment of typing of checking is correct;
4), verify again whether the publishing house of the current metadata for the treatment of typing is correct if the International Standard Book Number of the current metadata for the treatment of typing is correct;
If the publishing house of the current metadata for the treatment of typing is correct, the then current metadata of typing for the treatment of is described " quality is treated the typing metadata preferably ".
The method of described " whether the International Standard Book Number of verifying the current metadata for the treatment of typing is correct " is: the 1st to the 9th bit digital of International Standard Book Number multiply by 10 to 2 these 9 numerals in proper order, these sum of products are added verification number, if can be divided exactly by 11, then this International Standard Book Number is correct;
The method of described " whether the publishing house that verifies the current metadata for the treatment of typing is correct " is:
Whether the publishing house that selects the current metadata for the treatment of typing of publisher number checking from normalized International Standard Book Number is correct;
If number there is corresponding relation in publisher with the current publishing house of the metadata of typing that treats, then currently treat that the publishing house of the metadata of typing is correct;
If number there is not corresponding relation in publisher with the current publishing house of the metadata of typing that treats, then currently treat that the publishing house of the metadata of typing is incorrect.
Described " the current metadata of typing for the treatment of is carried out standardization processing " comprising: publication time, price specifications are turned to real number.
When data acquisition when being empty, described step 2), 3) be specially:
2) do not have in the data acquisition and the record for the treatment of that the typing metadata repeats;
3) quality is treated preferably typing metadata store inverse is according in the set.
When data acquisition is not sky, described step 2) comprising:
21) dwindle in the data acquisition, with the scope for the treatment of the record that the typing metadata compares;
22) in step 21) in the restricted portion, utilize the similarity comparison function of band weighted value, calculate the similarity value between the property value for the treatment of corresponding field in typing metadata and the data acquisition;
23) each field similarity is on duty with weighted value, addition obtains compound similarity value;
24) a compound similarity value and a predetermined threshold value are compared; If compound similarity value is not less than threshold value, then the current record in the data acquisition with treat that the typing metadata is a repeating data; If compound similarity value is less than threshold value, then the current record in the data acquisition with treat that the typing metadata is not a repeating data.
Described step 21) be specially:
211) in the record of data acquisition, choose and the identical record of publishing house's field for the treatment of the typing metadata, scope as a comparison;
212) in selected record, choose International Standard Book Number, title, author, publishing house, publication time, price field, as a comparison scope.
In step 22) described in the band weighted value the similarity comparison function comprise: integer similarity comparison function, similarity of character string comparison function, real number similarity comparison function.
The present invention treats the metadata (dirty data) of typing and carries out standardization processing, makes it not have pro forma apparent error, and the metadata quality of this moment is reasonable.Quality is treated that preferably each bar record compares in typing metadata and the data acquisition, whether have and the record for the treatment of that the typing metadata repeats in the judgment data set; Relatively the time,, reduce workload, increase work efficiency by dwindling comparison range.In thousands of records of data acquisition, choose and the identical record of publishing house's field for the treatment of the typing metadata, scope as a comparison; In selected record, choose International Standard Book Number, title, author, publishing house, publication time, price field, as a comparison scope.Utilize record and the similarity value for the treatment of the typing metadata in the set of similarity comparison function computational data, utilize weighted value training function calculation field weighted value; Each field similarity is on duty with weighted value, and addition obtains compound similarity value; A compound similarity value and a predetermined threshold value are compared; If compound similarity value is not less than threshold value, then the current record in the data acquisition with treat that the typing metadata is a repeating data; If compound similarity value is less than threshold value, then the current record in the data acquisition with treat that the typing metadata is not a repeating data.
Embodiment
At existing metadata cleaning field, remove the big problem of dirty data workload, the invention provides the method for removing repeating objects based on metadata, with reference to accompanying drawing 1, it comprises the steps:
1) the current metadata of typing for the treatment of is carried out standardization processing, judge whether it is that quality is treated the typing metadata preferably;
2) quality is treated preferably each bar record compares in typing metadata and the data acquisition, whether had and the record for the treatment of that the typing metadata repeats in the judgment data set;
3), among the two, choose the measured record of matter as data acquisition if duplicate record is arranged.
The information of an online book has comprised a large amount of metadata, and these metadata mostly are some dirty datas, i.e. poor quality's data.For instance: title: the The Romance of the Three Kingdoms; International Standard Book Number is: ISBN7-305-01568-7; Publisher number: 305; Publishing house: all sorts of flowers publishing house; Publication time: on June 9th, 1988; Languages, Chinese; Publish ground: Nanjing; Author: Luo Guanzhong; Responsible editor: Cao Xueqin; Current price: 109,90 yuan; The release: September in 1996 the 1st edition, in May, 1988 the third printing ... etc.In the above metadata, the part before the colon is a field, and the part behind the colon is a property value.Above information has been formed a record in data acquisition.In this record, property value all is correct, is called the measured data of matter.Property value in the reality in the metadata record often is wrong, also with the example that is recorded as of the described The Romance of the Three Kingdoms: title: the The Romance of the Three Kingdoms; International Standard Book Number is: ISBN8-305-01548-7; Publisher number: 306; Publishing house: spend hundred publishing houses; Publication time: on February 30th, 1988; Languages, Chinese; Publish ground: Nanjing; Author: Luo Guanzhong; Responsible editor: Cao Xueqin; Current price: 109,908 yuan; The release: September in 1996 the 1st edition, in May, 1988 the third printing ... etc.In this record, mistake has all appearred in the property value of field International Standard Book Number, publisher number, publishing house, publication time, responsible editor, current price etc.The data that are called dirty data or poor quality.
Should the measured metadata of typing matter in the data acquisition, the metadata that clear quality is bad.When the typing metadata, always, judge the quality quality for the treatment of the typing metadata at present by artificial.Inefficiency and standard disunity like this.
One, for poor quality's metadata, before typing, at first to carry out standardization processing:
1) International Standard Book Number is carried out standardization processing:
The general book colophon of all regular publication all has ISBN number, and ISBN is the abbreviation of the several English alphabets of international standard of book number, i.e. International Standard Book Number.It is made up of 10 bit digital, this 10 bit digital is made up of group number, publisher number, punctuation marks used to enclose the title, verification number this four part, uses "--" to link to each other therebetween, as: ISBN7-305-01568-7, group number is to represent the numbering of country languages, and China is numbered 7.Publisher number is the code name of publishing house, is provided with and is distributed desirable 1-7 bit digital by the ISBN center of country.Punctuation marks used to enclose the title are the numberings that given every kind of publication by publisher.Verification number is last bit value of ISBN number, it can verification go out ISBN number whether correct.The ISBN1-9 bit digital be multiply by these 9 numerals of 10-2 in proper order, these sum of products are added verification number, if can be divided exactly by 11, then this ISBN number is correct.
Below two steps 1,2 verified the pro forma correctness of International Standard Book Number.Each International Standard Book Number all must meet these pro forma requirements, could verify the correctness of International Standard Book Number own again:
1, judges whether the current International Standard Book Number of the metadata of typing for the treatment of contains nonnumeric character; If nonnumeric character is arranged, after this nonnumeric character deletion, keep this current metadata for the treatment of typing;
2, do you judge whether 10 bit digital are formed for the International Standard Book Number of the current metadata for the treatment of typing? if International Standard Book Number is not 10 bit digital, then be divided into two kinds of situations and handle: International Standard Book Number is less than 8, then abandons this current metadata for the treatment of typing; International Standard Book Number surpasses 10, then with after 10 later digit deletions, keeps this current metadata for the treatment of typing;
3, the 1st of International Standard Book Number the to the 9th bit digital multiply by 10 to 2 these 9 numerals in proper order, and these sum of products are added verification number, if can be divided exactly by 11, then this International Standard Book Number is correct.Also with the example that is recorded as of the described The Romance of the Three Kingdoms.International Standard Book Number is: ISBN7-305-01568-7, formula are 7*10+3*9+0*8+5*7+0*6+1*5+5*4+6*3+8*2+7=198, and 198/11=18 can be divided exactly by 11.Then this International Standard Book Number is correct.International Standard Book Number is: ISBN8-305-01548-7; Formula is that 8*10+3*9+0*8+5*7+0*6+1*5+5*4+4*3+8*2+7=204 204/11=18 surpluss 6, can not be divided exactly by 11.Then this International Standard Book Number is incorrect.
2) publishing house is carried out standardization processing
1, judges and currently treat whether the International Standard Book Number of the metadata of typing is the character string pattern; If other pattern characters are arranged, after its deletion, keep this current metadata for the treatment of typing;
Whether 2, select the publishing house of the current metadata for the treatment of typing of publisher number checking from normalized International Standard Book Number correct;
If number there is corresponding relation in publisher with the current publishing house of the metadata of typing that treats, then currently treat that the publishing house of the metadata of typing is correct;
If number there is not corresponding relation in publisher with the current publishing house of the metadata of typing that treats, then currently treat that the publishing house of the metadata of typing is incorrect.
Publisher number is the code name of publishing house, is provided with and is distributed desirable 1-7 bit digital by the ISBN center of country.For example International Standard Book Number is: ISBN7-305-01568-7 therefrom extracts publisher number: 305; Finding corresponding publishing house then is all sorts of flowers publishing houses.If treating the typing metadata is all sorts of flowers publishing houses; Then currently treat that the publishing house of the metadata of typing is correct.
3) title, author's standard are turned to character string, if occur the character of numeral or other patterns in the middle of them.After it should being removed, keep this metadata.For example, treat typing metadata author: during sieve 9 passes through or the author: the Roseau, in, during standardization with 9 and, deletion, keep the author: during sieve passes through or the author: carry out later processing in the Roseau.
4) publication time, price specifications are turned to real number.If occur the character of Chinese character or other patterns in the middle of them.After it should being removed, keep this metadata.For example, treat typing metadata publication time: 1988-6f-9 or 198 water 8-6-9 after during standardization f and water being removed, keep publication time: 1988-6-9 and carry out later processing.
5) with responsible editor, current price, release, brief introduction, classification, descriptor ... wait and carry out standardization processing.
Dirty data through after the standardization has not had pro forma apparent error, and the metadata quality of this moment is reasonable.
Two, quality is treated preferably each bar record compares in typing metadata and the data acquisition, whether had and the record for the treatment of that the typing metadata repeats in the judgment data set.
Discuss according to the two kinds of situations that how much are divided into that write down in the data acquisition: when 1) data acquisition is for sky; With 2) when data acquisition is not sky;
When 1) data acquisition is empty, directly be entered into the metadata for the treatment of typing in the data acquisition;
2) when data acquisition be not empty, illustrating has some records in the data acquisition; With reference to accompanying drawing 2, be divided into following steps and carry out typing:
A) dwindle in the data acquisition, with the scope for the treatment of the record that the typing metadata compares;
Through the ages the recording of information of various books has thousands of in the data acquisition, treats that as one the typing metadata will be entered in the data acquisition, need search the record that whether has with its repetition in thousands of records of data acquisition; In order to reduce workload, increase work efficiency.Need dwindle in the data acquisition, with the scope for the treatment of the record that the typing metadata compares; Concrete measure:
A1, in the record of data acquisition, choose and the identical record of publishing house's field for the treatment of the typing metadata, scope as a comparison;
Have in thousands of records of data acquisition much all is that same publishing house publishes.Relatively the time, the record identical with publishing house's field for the treatment of the typing metadata extracted scope as a comparison.
For example the metadata of the The Romance of the Three Kingdoms is gone into to record in the data acquisition, its publishing house is all sorts of flowers publishing houses.The property value that extracts field in data acquisition is the record of all sorts of flowers publishing house, scope as a comparison.
A2, in selected record, choose International Standard Book Number, title, author, publishing house, publication time, price field, as a comparison scope.
In order to reduce workload, increase work efficiency.Further drawdown ratio scope in selected scope with identical publishing house.Choose International Standard Book Number, title, author, publishing house, publication time, price field, as a comparison scope.
B) in the step a) restricted portion, utilize the similarity comparison function of band weighted value: f (r
1, r
2)=f ' (r '
1, r '
2)-α (1-f ' (r "
1, r "
2)), f ' ∈ [0,1] calculates the similarity value between the property value for the treatment of corresponding field in typing metadata and the data acquisition, and wherein f ' is a similarity comparison function of the prior art, r
1, r
2For treating the property value of corresponding field in typing metadata and the data acquisition (International Standard Book Number, title, author, publishing house, publication time, price field), r '
1, r '
2For property value is removed the part ignore behind the speech, r "
1, r "
2For only keeping weight speech part in the property value, α is a weighted value, for by training algorithm training gained, and under the situation that does not have the weight speech, f (r
1, rx)=f ' (r '
1, r '
2).For instance: for publishing house's field: in publishing house of property value Tsing-Hua University and BJ University Press, these speech of university press can be regarded as and ignore speech the relatively too big meaning not of this field of publishing house.In the time of relatively, only compare Tsing-Hua University and Beijing, be r '
1, r '
2For property value in the title field: the The Romance of the Three Kingdoms (up and down) is exactly up and down the weight speech, is r "
1, r "
2
Described similarity comparison function comprises: integer similarity comparison function, similarity of character string comparison function, real number similarity comparison function.
The comparison function of isbn field if isbn equates then to be 1, otherwise is 0;
Title field comparison function is the similarity of character string value of cutting gained speech
The author field comparison function is the similarity of character string value of cutting gained speech;
Publication time comparison function, adopt the relative mistake function to obtain the similarity value;
The price comparison function adopts the relative mistake function to obtain the similarity value;
C) utilize compound similarity function
α wherein
0Be threshold value, α
iBe weight, R
1, R
2Be metadata, f
i(R
1, R
2) be R
1And R
2The similarity comparison function of the band weighted value of field i calculates the compound similarity value for the treatment of the typing metadata;
D) a compound similarity value and a predetermined threshold value are compared; If compound similarity value is not less than threshold value, then the current record in the data acquisition with treat that the typing metadata is a repeating data; If compound similarity value is less than threshold value, then the current record in the data acquisition with treat that the typing metadata is not a repeating data.
The present invention treats the metadata (dirty data) of typing and carries out standardization processing, makes it not have pro forma apparent error, and the metadata quality of this moment is reasonable.Quality is treated that preferably each bar record compares in typing metadata and the data acquisition, whether have and the record for the treatment of that the typing metadata repeats in the judgment data set; Relatively the time,, reduce workload, increase work efficiency by dwindling comparison range.In thousands of records of data acquisition, choose and the identical record of publishing house's field for the treatment of the typing metadata, scope as a comparison; In selected record, choose International Standard Book Number, title, author, publishing house, publication time, price field, as a comparison scope.Utilize record and the similarity value for the treatment of the typing metadata in the set of similarity comparison function computational data, utilize weighted value training function calculation field weighted value; Each field similarity is on duty with weighted value, and addition obtains compound similarity value; A compound similarity value and a predetermined threshold value are compared; If compound similarity value is not less than threshold value, then the current record in the data acquisition with treat that the typing metadata is a repeating data; If compound similarity value is less than threshold value, then the current record in the data acquisition with treat that the typing metadata is not a repeating data.