Summary of the invention
Embodiment of the invention technical matters to be solved provides a kind of method and device of definite file bag resource identification information, in order to when obtaining file bag resource identification, simplifying the operation, to reduce resource occupation, to reduce cost of development, and effectively guarantee the representativeness and the accuracy of the file bag resource identification that obtains.
Another purpose of the embodiment of the invention provides a kind of generation method of file bag resource database and a kind of searching method of file bag resource, in order to effective raising file bag resource search efficiency and accuracy rate.
For solving the problems of the technologies described above, the embodiment of the invention provides a kind of method of definite file bag resource identification information, comprising:
Obtain the attribute information of candidate object, described candidate object is file and/or the file that a literature kit is comprised;
If it is pre-conditioned that the attribute information of one of them candidate object satisfies, the name information that then definite described attribute information satisfies pre-conditioned candidate target is the resource identification information of described literature kit.
Preferably, file that described literature kit comprised and/or file are:
File and/or file that the root directory layer of described literature kit is comprised.
Preferably, described attribute information is a document size information.
Preferably, describedly pre-conditionedly be:
The size of file or folder is maximum in the described literature kit; Or,
The size of file or folder accounts for the magnitude proportion of described literature kit more than or equal to a threshold value.
Preferably, described literature kit is the compressed file bag, and described document size information is the size information before file or folder is compressed.
Preferably, described attribute information is a name information.
Preferably, describedly pre-conditionedly be:
The name information of file or folder does not match with the invalid key that presets; And/or,
The name information of file or folder is complementary with the effective key word that presets.
Preferably, described candidate target also comprises described literature kit, describedly pre-conditionedly is:
The name information of described literature kit, or the name information of file or folder does not match with the invalid key that presets in the described literature kit; And/or,
The name information of described literature kit, or the name information of file or folder is complementary with the effective key word that presets in the described literature kit.
Preferably, described pre-conditionedly also comprise:
To be weighted with the name information that the effective key word that presets is complementary, extract the highest name information of weights.
Preferably, described method also comprises:
Replace the name information of described literature kit with described resource identification information.
The embodiment of the invention also discloses a kind of generation method of file bag resource database, comprising:
Obtain the URL and the content sig ID thereof of literature kit in the Webpage, described content sig ID obtains by the content-data calculating back of pre-defined algorithm to literature kit, and described pre-defined algorithm obtains different disposal result's algorithm for the content-data of handling different binary files;
Obtain the resource identification information of described literature kit, described resource identification information is the name information that attribute information satisfies a pre-conditioned candidate object, and described candidate object comprises file and/or the file that described literature kit comprises;
Write down described content sig ID and corresponding file bag URL and file bag resource identification information, form database.
The embodiment of the invention also discloses a kind of searching method of file bag resource, comprising:
Initialized data base, described database comprise content sig ID and corresponding file bag URL and file bag resource identification information, and described file bag resource URL obtains by the file bag resource that grasps in the Webpage; Described content sig ID obtains by the content-data calculating of pre-defined algorithm to literature kit, and described pre-defined algorithm obtains different disposal result's algorithm for the content-data of handling different binary files; Described file bag resource identification information is the name information that attribute information satisfies a pre-conditioned candidate object, and described candidate object comprises file and/or the file that described literature kit comprises;
Search key according to user's input mates corresponding file bag resource identification information in described database;
File bag resource identification information by coupling is searched corresponding file bag URL; Or, by searching corresponding content sig ID, and obtain corresponding literature kit URL according to described content sig ID.
Preferably, described method also comprises:
Data from described URL download file resource.
The embodiment of the invention also discloses a kind of device of definite file bag resource identification information, comprising:
The fileinfo acquiring unit is used to obtain the attribute information of candidate object, and described candidate object is file and/or the file that a literature kit is comprised;
Judging unit is used to judge whether the attribute information of described candidate object satisfies pre-conditioned;
Determining unit is used for therein the attribute information of a candidate target and satisfies when pre-conditioned, determines that the described name information that satisfies pre-conditioned candidate object is the resource identification information of described literature kit.
Preferably, file that described literature kit comprised and/or file are:
File and/or file that the root directory layer of described literature kit is comprised.
Preferably, described attribute information is a document size information.
Preferably, describedly pre-conditionedly be:
The size of file or folder is maximum in the described literature kit; Or
The size of file or folder accounts for the magnitude proportion of described literature kit more than or equal to a threshold value.
Preferably, described literature kit is the compressed file bag, and described document size information is the size information before file or folder is compressed.
Preferably, described attribute information is a name information.
Preferably, describedly pre-conditionedly be:
The file or folder name information does not match with the invalid key that presets; And/or
The file or folder name information is complementary with the effective key word that presets.
Preferably, described candidate target also comprises described literature kit, describedly pre-conditionedly is:
The name information of described literature kit, or the name information of file or folder does not match with the invalid key that presets in the described literature kit; And/or,
The name information of described literature kit, or the name information of file or folder is complementary with the effective key word that presets in the described literature kit.
Preferably, described pre-conditionedly also comprise:
To be weighted with the name information that the effective key word that presets is complementary, extract the highest name information of weights.
Preferably, described device also comprises:
The unit of renaming is used for replacing with described resource identification information the name information of described literature kit.
The embodiment of the invention also discloses a kind of generating apparatus of file bag resource database, comprising:
Resource URL placement unit is used for obtaining the URL of the literature kit of Webpage;
Content sig ID computing unit is used for obtaining by the content-data of pre-defined algorithm calculation document bag the content sig ID of described literature kit, and described pre-defined algorithm obtains different disposal result's algorithm for the content-data of handling different binary files;
The resource identification information acquiring unit, be used to obtain the identification information of described file bag resource, described identification information is the name information that attribute information satisfies a pre-conditioned candidate object, and described candidate object comprises file and/or the file that described literature kit comprises;
Record cell is used to write down described content sig ID and corresponding file bag resource URL and file bag resource identification information, forms database.
The embodiment of the invention also discloses a kind of searcher of file bag resource, comprising:
Database, described database comprise content sig ID and corresponding file bag resource URL and file bag resource identification information, and described file bag resource URL obtains by the file bag resource that grasps in the Webpage; Described content sig ID obtains by the content-data calculating of pre-defined algorithm to literature kit, and described pre-defined algorithm obtains different disposal result's algorithm for the content-data of handling different binary files; Described file bag resource identification information is the name information that attribute information satisfies a pre-conditioned candidate object, and described candidate object comprises file and/or the file that described literature kit comprises;
Matching unit is used for mating corresponding file bag resource identification information according to the search key of user's input at described database;
Search the unit, be used for searching corresponding file bag URL by the file bag resource identification information of coupling; Or, by searching corresponding content sig ID, and obtain corresponding literature kit URL according to described content sig ID.
Preferably, described device also comprises:
Download unit is used for from the data of described URL download file resource.
The embodiment of the invention has the following advantages:
At first, the embodiment of the invention meets pre-conditioned file or folder by choose an attribute information from the file or folder that literature kit comprised, and with the resource identification information of its name information as literature kit, because literature kit is made up of file or folder, so choose the identification information of the name information of one of them, more accurately the content of identification document bag as literature kit.
The embodiment of the invention can also be according to each file of literature kit root directory layer and the size of file, choose maximum, perhaps, size satisfies the name information of the file or folder of certain threshold value, resource identification information for the current file bag, promptly choose the identification information of the name information of the file or folder of weight maximum in the whole file bag resource, can obtain more representative file bag resource identification as literature kit;
The embodiment of the invention can also be further from described maximum, perhaps, size satisfies the name information of the file or folder of certain threshold value, in the name information of current file bag, choose the name information of file or folder more accurately according to presetting rule, as the identification information of literature kit, thereby further improved the accuracy that file bag resource identification information is determined;
Moreover the embodiment of the invention can also generate database by the described file bag resource identification information of record, in practice, can utilize this database to carry out file bag resource search and down operation.Because the accuracy of file bag resource identification information is higher, thereby can improve the accuracy of Search Results; Because described identification information is corresponding to the content sig ID record of literature kit, this content sig ID can be avoided the literature kit re-treatment that content is identical, thereby can improve search efficiency;
At last, the present invention is for the service provider, and technology realizes simple, and no technology barrier does not have special secret algorithm, and cost and risk is lower.
Embodiment
For above-mentioned purpose of the present invention, feature and advantage can be become apparent more, the present invention is further detailed explanation below in conjunction with the drawings and specific embodiments.
The present invention can be used in numerous general or special purpose computingasystem environment or the configuration.For example: personal computer, server computer, handheld device or portable set, plate equipment, multicomputer system, comprise distributed computing environment of above any system or equipment or the like.
The present invention can describe in the general context of the computer executable instructions of being carried out by computing machine, for example program module.Usually, program module comprises the routine carrying out particular task or realize particular abstract, program, object, assembly, data structure or the like.Also can in distributed computing environment, put into practice the present invention, in these distributed computing environment, by by communication network connected teleprocessing equipment execute the task.In distributed computing environment, program module can be arranged in the local and remote computer-readable storage medium that comprises memory device.
With reference to figure 1, show the process flow diagram of method embodiment one of a kind of definite file bag resource identification information of the embodiment of the invention, can may further comprise the steps:
Step 101, obtain the attribute information of candidate object, described candidate object comprises file and/or the file that a literature kit is comprised;
Step 102, pre-conditioned if the attribute information of one of them candidate object satisfies determines that then the described name information that satisfies pre-conditioned candidate target is the resource identification information of described literature kit.
Preferably, file that described literature kit comprised and/or file can for, file and/or file that the root directory layer of described literature kit (being ground floor catalogue in the literature kit) is comprised.Because generally the content information that the name information of the file or folder of the root directory layer of literature kit more can the representation file bag so can improve the efficient that file bag resource identification information is determined.In this case, only need be at the included file of literature kit root directory layer and/or the attribute information of file, whether satisfy pre-conditioned judgement, can determine the name information that satisfies some file or folders pre-conditioned, described literature kit root directory layer is the resource identification information of described literature kit.
In practice, each literature kit all has a corresponding file tabulation, is used for the information of log file bag All Files and file.The form that listed files shows in dissimilar literature kit is different, every kind of all corresponding agreement separately of compression type, so agreement according to each compression type, grasp the file header of literature kit by web crawlers (spider) or other instrument, file header is analyzed, just can be obtained corresponding listed files.
Particularly, the acquisition process of described listed files is: the spider program is creeped automatically according to the link between the page in the internet, in case find the literature kit (as the compressed file bag) of preset kind, then grasps; After the extracting, the file header of current file bag is analyzed, promptly obtained corresponding listed files.
Generally include in the listed files: size information, file type, modification time and annotation information after name information (filename), directory information (being embodied in the filename usually), document size information, the compression or the like.For example, the file header information of certain zip formatted file bag is as shown in the table:
Compression method (compression algorithm) |
2bytes |
Crc-32 (cyclic redundancy check (CRC)) |
4bytes |
Compressed size (compression back size) |
4bytes |
Uncompressed size (original size) |
4bytes |
Filename length (filename length) |
2bytes |
Filename (file name) |
variable size |
Wherein, filename is exactly the name information of file, and uncompressed size is exactly the original size information of uncompressed.In practice, after the information extracting of spider with file header, can also extract filename, uncompressed size information deposits in the database.
In embodiments of the present invention, described attribute information can be document size information, in this case, in the described step 102 pre-conditioned can for:
The size of condition S1, file or folder is maximum in the described literature kit.
Perhaps,
The size of condition S2, file or folder accounts for the magnitude proportion of described literature kit more than or equal to a threshold value.
With reference to figure 2, show and use described condition S1 to determine the process flow diagram of the method embodiment two of file bag resource identification information as of the present invention another, can may further comprise the steps:
Step 201, each file that obtains a literature kit root directory layer and/or the size information of file;
Step 202, if the size information of certain file or folder is maximum in All Files and the literature kit in the described literature kit root directory layer, the name information of then extracting this document or file is the resource identification information of described literature kit.
In the present embodiment, document size information based on the listed files reflection, the file of the root directory layer of statistics file bag and/or the size of file, the size of described file or folder is compared, if corresponding size is maximum file or folder, then represent the content of this document or file is occupied an leading position in this document bag, therefore, can determine the name information of the file or folder that this is maximum, be the resource identification information of this document bag.
For example, the listed files information of certain literature kit comprises: the rule political affairs Shinjin-O first collection .rmvb, the rule political affairs Shinjin-O second collection .rmvb, TV play are shone performer's picture Xin Wan .jpg of army, TV play according to performer's picture lily.jpg, player .exe; Wherein, the file of root directory layer and file are: rule political affairs Shinjin-O, TV play are shone, player .exe, the size information of supposing described each file and file is respectively: rule political affairs Shinjin-O-3GB, TV play photograph-2MB, player-23MB, because " rule political affairs Shinjin-O " corresponding file size is maximum, be the resource identification information of described literature kit so can choose " rule political affairs Shinjin-O " in the file of all root directory layers.
With reference to figure 3, show and use described condition S2 to determine the process flow diagram of the method embodiment three of file bag resource identification information as of the present invention another, can may further comprise the steps:
Step 301, each file that obtains a literature kit root directory layer and/or the size information of file;
Step 302, if the size information of certain file or folder in described literature kit root directory layer in the big or small sum in All Files and the literature kit, proportion is more than or equal to a threshold value, and the name information of then extracting this document or file is the resource identification information of described literature kit.
In the present embodiment, document size information based on described listed files reflection, the file of the root directory layer of statistics file bag and/or the size of file, if accounting for whole literature kit All Files size, the size of certain file or folder surpasses certain proportion, then represent this document or file are occupied an leading position in literature kit, therefore, just with the name information of this document or file, be defined as the resource identification information of this document bag.
For example, the listed files information of certain literature kit comprises: the rule political affairs Shinjin-O first collection .rmvb, the rule political affairs Shinjin-O second collection .rmvb, TV play are shone performer's picture Xin Wan .jpg of army, TV play according to performer's picture lily.jpg, player .exe; Wherein, the file of root directory layer is: rule political affairs Shinjin-O, TV play photograph, and, the file of root directory layer is: player .exe, in this case, then need from: rule political affairs Shinjin-O, TV play according to and player .exe determine resource identification information, be specifically as follows:
A. obtain described each file or file size w (i), calculate a total big or small total, for example, can pass through following formulate:
Total=w(1)+w(2)+w(3)
B. the big or small w (i) that calculates each file or folder respectively accounts for the ratio f (i) of described total big or small total, for example, can pass through following formulate:
f(i)=w(i)/total
If above-mentioned f (i) is greater than certain preset ratio, as 80%, then with the name information of this document or file as resource identification information.Suppose to go up in the example, the ratio of " rule political affairs Shinjin-O " reach 80% or more than, then will " rule political affairs Shinjin-O " as resource identification information of this document bag.
Need to prove, in embodiments of the present invention, because the file of corresponding different-format, compress mode also has difference, thereby causes the file size after each compression, and existence and itself size are not the problems that reduces in proportion, thereby in this case, if described literature kit is the compressed file bag, the document size information in the foregoing description is preferably the size information before the file or folder compression, but not the size information after the compression.In addition, have under a plurality of situations at file bag resource identification information, can choose a resource identification information wantonly as literature kit, perhaps, adopt Else Rule to choose a resource information as literature kit, perhaps, all be feasible as the resource identification information of literature kit directly with described a plurality of name informations, the present invention does not limit this.
In embodiments of the present invention, described attribute information can also be name information, in this case, in the described step 102 pre-conditioned can for:
The name information of condition S3, file or folder does not match with the invalid key that presets;
And/or,
The name information of condition S4, file or folder is complementary with the effective key word that presets.
In practice, the name information of some literature kit also can reflect the main contents of file bag resource truly, effectively, thereby preferred, and described candidate target can also comprise described literature kit itself, in this case, described pre-conditioned can for:
The name information of condition S3 ', described literature kit, or the name information of file or folder does not match with the invalid key that presets in the described literature kit;
And/or,
The name information of condition S4 ', described literature kit, or the name information of file or folder is complementary with the effective key word that presets in the described literature kit.
With reference to figure 4, show described condition S3 ' of application and S4 ' to determine the process flow diagram of the method embodiment four of file bag resource identification information as of the present invention another, can may further comprise the steps:
Step 401, obtain the name information of a literature kit, and the name information of each file or folder of being comprised of this document bag;
Step 402, these name informations and the invalid key that presets are mated, if coupling, then filter the invalid name information of coupling.
By this step, can filter out invalid name information, in the present invention, described invalid name information can be understood as, the correct name information of identification document bag main contents, for example " 123.rar ", " new folder .zip ", " 20071021.rar ", " test.zip " etc., in this case, can be by compiling the data of invalid name, correspondence is provided with corresponding invalid key, as pure digi-tal, " new folder " etc. is when the name information of literature kit and the file or folder that comprised, when all mating or partly mating, then remove this name information with invalid key.If after above-mentioned removal processing, only surplus next name information or a plurality of identical name information then can be directly with the resource identification information of this name information as the current file bag.Otherwise, can continue to carry out following steps.
Step 403, remaining effective name information after the above-mentioned filtration treatment is further mated with the effective key word that presets, choose effective name information of coupling.
For example, effective key word is set is " software ", " formal version ", " beta version " or the like (the common energy of this effective key word supporting paper bag is by standardize naming), if when wherein one or more identical effective name informations are all mated or partly mated with this effective keyword, then can determine the resource identification information that this effective name information is the current file bag.If also there are a plurality of different effective name informations, then can continue to carry out following steps and determine best file bag resource identification.
Preferably, described effective key word can comprise the software version number key word.This set is primarily aimed at the resource identification information of software class.Because most of software all can embody version number nominally in the reality, for example " PowerWord 5.7.6.426 ", " fleeing hare V3.0.108 ", " V2008 Build 1890 seen repeatedly in Chinese idiom " etc., for this type of name information, can mate at this software version number again by extracting its software version number.More preferred, the extracting method of described software version number can for, judge whether the continuation character string that separates by period ". " is arranged in the title, perhaps, whether initial is arranged is ' V ' or ' v ', and the back connects the name information of digit strings, if then be software version number.
For guaranteeing that further determined file bag resource identification has the maximum representative of literature kit content, can also increase pre-conditioned S5: will be weighted with the name information that the effective key word that presets is complementary, extracting the highest name information of weights is the resource identification information of current file bag.In this case, present embodiment can also may further comprise the steps:
Step 404, be weighted at effective name information of the effective key word of coupling, extracting the highest effective name information of weight is the resource identification information of described literature kit.
Below further specify present embodiment by an object lesson.
Suppose the file of include file " Foundation of Software Engineering " by name in file " test.zip " literature kit by name, preset invalid key and be " test ", effectively key word is " software ", when determining file bag resource identification information, a kind of processing can for, since the filename of literature kit and invalid key " test " coupling, the filename of described filtration " test.zip ", and " Foundation of Software Engineering " directly will being left is as the resource identification information of this document bag; Another kind of processing can for, continue will " Foundation of Software Engineering " this filename to mate with effective key word " software ", find that their mate, so " Foundation of Software Engineering " this file is weighted.Through above-mentioned processing, the weight of " Foundation of Software Engineering " file will be apparently higher than the weight of " test.zip " filename, so determine that the resource identification information of this document bag is " Foundation of Software Engineering ".
In embodiments of the present invention, described attribute information can also comprise document size information and name information, and in this case, those skilled in the art are can combination in any above-mentioned pre-conditioned, as combination condition S2, S3 ', S4 ' and S5 etc.
With reference to figure 5, show the process flow diagram of the method embodiment five of a kind of definite file bag resource identification information of the present invention, can may further comprise the steps:
Step 501, obtain the attribute information of candidate object;
Described candidate object comprises file and/or the file that a literature kit is comprised; Described attribute information comprises document size information and name information.
Step 502, pre-conditioned if the attribute information of one of them candidate object satisfies determines that then the described name information that satisfies pre-conditioned candidate target is the resource identification information of described literature kit.
This step can comprise following substep:
Substep 5021, choose file size and account for alternative file and/or the file of the ratio of described literature kit All Files size sum more than or equal to a threshold value;
Substep 5022, with the name information of described literature kit, the name information of alternative file or file is mated with the invalid key that presets, and filters the invalid name information of coupling;
Substep 5023, will filter the remaining name information in back and mate, effective name information of coupling will be weighted with effective key word of presetting;
Substep 5024, the highest name information of extraction weights are the resource identification information of literature kit.
Below further specify present embodiment by an object lesson.
Suppose to comprise among literature kit 17173 mht37.rar three files: " software interface .JPG ", " practical tool kit V3.7.exe " and " more new description .txt ", total size of calculating above file is: total=5733902, the f (i) that obtains each file then is: f (1)=1.2%, f (2)=98.6%, f (3)=0.2%, if the preset ratio threshold value is 80%, then owing to f (2)>80%, so " dreamlike Journey to the West practical tool kit V3.7 " can be used as candidate's resource identification information; Because this information is also mated effective keyword " tool box ", and has software version number " V3.7 ", therefore further to its weighting.In this case, determine that candidate's resource identification information " dreamlike Journey to the West practical tool kit V3.7 " is the resource identification information of current file bag.
Preferably, in embodiments of the present invention, the file bag resource identification information for determining can also be used to replacing the raw filename of described literature kit, as above in the example, replace original bag name information " 17173 mht37 " with " dreamlike Journey to the West practical tool kit V3.7 "; It can be saved as database, with demands such as the search of satisfying file bag resource, downloads, the present invention does not need this to limit yet.
What those skilled in the art were easy to expect is, the combination in any of the embodiment of the invention all is embodiment of the present invention, but this instructions has not just described in detail one by one at this as space is limited.
For aforesaid each method embodiment, for simple description, so it all is expressed as a series of combination of actions, but those skilled in the art should know, the present invention is not subjected to the restriction of described sequence of movement, because according to the present invention, some step can adopt other orders or carry out simultaneously.Secondly, those skilled in the art also should know, the embodiment described in the instructions all belongs to preferred embodiment, and related action and module might not be that the present invention is necessary.
With reference to figure 6, show the structured flowchart of the device embodiment of a kind of definite file bag resource identification information of the present invention, can comprise with lower unit:
Fileinfo acquiring unit 601 is used to obtain the attribute information of candidate object, and described candidate object comprises file and/or the file that a literature kit is comprised; Preferably, file that described literature kit comprised and/or file can for, file and/or file that the root directory layer of described literature kit is comprised;
Judging unit 602 is used to judge whether the attribute information of described candidate object satisfies pre-conditioned;
Determining unit 603 is used for therein the attribute information of a candidate target and satisfies when pre-conditioned, determines that the described name information that satisfies pre-conditioned candidate object is the resource identification information of described literature kit.
In the present embodiment, preferred, described attribute information can be document size information; In this case, described pre-conditioned can for:
The size of file or folder is maximum in the described literature kit; Or
The size of file or folder accounts for the magnitude proportion of described literature kit more than or equal to a threshold value.
In practice, described literature kit can be the compressed file bag, is the accuracy that guarantees that file bag resource identification information is determined, described document size information can be compressed preceding size information for file or folder.
In this case, use first kind of preferred embodiment shown in Figure 6 and determine that the process of file bag resource identification information can may further comprise the steps:
Step B1, fileinfo acquiring unit obtain the size information of literature kit institute's include file and/or file;
As another embodiment, this step can also for, the fileinfo acquiring unit obtains each file of a literature kit root directory layer and/or the size information of file;
Whether the size information of step B2, certain file or folder of judgment unit judges is maximum in All Files and the literature kit in the described literature kit root directory layer; Perhaps, in the big or small sum in All Files and the literature kit, proportion is more than or equal to a threshold value, if then trigger determining unit execution in step B3 in described literature kit root directory layer for the size information of judging certain file or folder;
Step B3, determining unit determine that a name information that satisfies the file or folder of above-mentioned condition is the resource identification information of described literature kit.
As another embodiment, described attribute information can be name information; In this case, described pre-conditioned can for:
The file or folder name information does not match with the invalid key that presets;
And/or
The file or folder name information is complementary with the effective key word that presets.
Preferably, described pre-conditioned can also comprising:
To be weighted with the name information that the effective key word that presets is complementary, extract the highest name information of weights.
In this case, use second kind of preferred embodiment shown in Figure 6 and determine that the process of file bag resource identification information can may further comprise the steps:
Step C1, fileinfo acquiring unit obtain the name information of literature kit institute's include file and/or file;
As another embodiment, this step can also for, the fileinfo acquiring unit obtains each file of a literature kit root directory layer and/or the name information of file;
Step C2, judging unit mate these name informations and the invalid key that presets, after filtering the invalid name information of coupling, remaining effective name information is further mated with the effective key word that presets, and be weighted at effective name information of coupling;
Step C3, the highest effective name information of extraction weight are the resource identification information of described literature kit.
As another embodiment, described attribute information can be name information, and described candidate target can also comprise described literature kit itself, in this case, described pre-conditioned can for:
The name information of described literature kit, or the name information of file or folder does not match with the invalid key that presets in the described literature kit; And/or,
The name information of described literature kit, or the name information of file or folder is complementary with the effective key word that presets in the described literature kit.
Preferably, described pre-conditioned can also comprising:
To be weighted with the name information that the effective key word that presets is complementary, extract the highest name information of weights.
In this case, use the third preferred embodiment shown in Figure 6 and determine that the process of file bag resource identification information can may further comprise the steps:
Step D1, fileinfo acquiring unit obtain the name information of a literature kit, and the name information of this document bag institute's include file and/or file;
As another embodiment, this step can also for, the fileinfo acquiring unit obtains the name information of a literature kit, and each file of this document bag root directory layer and/or the name information of file;
Step D2, judging unit mate these name informations and the invalid key that presets, after filtering the invalid name information of coupling, remaining effective name information is further mated with the effective key word that presets, and be weighted at effective name information of coupling;
Step D3, the highest effective name information of extraction weight are the resource identification information of described literature kit.
In said apparatus embodiment, described literature kit can also comprise the unit of renaming, and is used for replacing with described resource identification information the raw filename of described literature kit.
What those skilled in the art were easy to expect is: it all is feasible that the combination in any of above-mentioned first kind of embodiment and second kind of embodiment, first kind of embodiment and the third embodiment is used, so the combination in any of the foregoing description all is embodiment of the present invention, but this instructions has not just described in detail one by one at this as space is limited.
For the device embodiment of above-mentioned definite file bag resource identification information, because it is substantially corresponding to the method embodiment of aforementioned definite file bag resource identification information, so description is fairly simple, relevant part gets final product referring to the part explanation of method embodiment.
With reference to figure 7, show the process flow diagram of the generation method embodiment of a kind of file bag resource database of the present invention, can may further comprise the steps:
Step 701, the URL that obtains literature kit in the Webpage and content sig ID thereof;
Wherein, described content sig ID can obtain by the content-data calculating back of pre-defined algorithm to literature kit, and described pre-defined algorithm can obtain different disposal result's algorithm for the content-data of handling different binary files;
Step 702, obtain the resource identification information of described literature kit;
Wherein, described resource identification information can satisfy the name information of a pre-conditioned candidate object for attribute information, and described candidate object can comprise file and/or the file that described literature kit comprises.
Preferably, file that described literature kit comprised and/or file can for, file and/or file that the root directory layer of described literature kit is comprised.
Step 703, the described content sig ID of record and corresponding file bag URL and file bag resource identification information form database.
In practice, file bag resource in the page and URL thereof can be grasped by web crawlers (spider), be well known that, the course of work of web crawlers can for, based on the thought of BFS (Breadth First Search), URL (Uniform Resource Locator from one or several Initial pages, URL(uniform resource locator)) beginning obtains the URL on the Initial page, in the process that grasps webpage, constantly extract new URL and put into formation, up to the certain stop condition that satisfies system from current page.Perhaps, also can be used as a program of downloading webpage automatically, filter and irrelevant the linking of theme, remain with the link of usefulness and put it into and wait for the URL formation of grasping according to certain web page analysis algorithm; Then, will from formation, select next step webpage URL that will grasp, and repeat said process, when reaching a certain condition of system, stop according to certain search strategy.
Uniqueness for the file bag resource that guarantees to grasp, avoid the file bag resource re-treatment identical to content, can calculate a content sig ID and come the corresponding literature kit of unique identification, described content sig ID can obtain by the content-data calculating back of pre-defined algorithm to binary file, described pre-defined algorithm can obtain different disposal result's algorithm for the content-data of handling different binary files, also can be the extremely low algorithm of result repetition rate, for example, content-data to each binary file carries out Hash operation, obtain the cryptographic hash of file content, the cryptographic hash of this document content promptly can be used as the content sig ID, in order to unique identification corresponding file bag.
Particularly, a kind of method of calculating the content sig ID is: choose file bag resource before, during and after each 32KB data (can certainly choose the content-data of literature kit other parts, only make example at this), utilize hash algorithm (can adopt md5-challenge, MD5, MD4, Secure Hash Algorithm etc. are as formula) respectively these three parts are calculated, after resulting three values are linked in sequence, utilize above algorithm that the data after connecting are calculated once more, with the value that obtains at last content sig ID as this document bag; For the identical a plurality of literature kit of content sig ID, can think that they have identical content.When generating database, can be major key then with described content sig ID, storage corresponding file bag resource identification information and URL information.
After having generated the file bag resource database, can further utilize this database that services such as search or download are provided.
With reference to figure 8, show the structured flowchart of the generating apparatus embodiment of a kind of file bag resource database of the present invention, can comprise with lower unit:
Resource URL placement unit 801 is used for obtaining the URL of the literature kit of Webpage;
Content sig ID computing unit 802 is used for obtaining by the content-data of pre-defined algorithm calculation document bag the content sig ID of described literature kit;
Preferably, described pre-defined algorithm can obtain different disposal result's algorithm for the content-data of handling different binary files.
Resource identification information acquiring unit 803 is used to obtain the identification information of described file bag resource;
Wherein, described identification information can satisfy the name information of a pre-conditioned candidate object for attribute information, and described candidate object can comprise file and/or the file that described literature kit comprises; Preferably, file that described literature kit comprised and/or file can for, file and/or file that the root directory layer of described literature kit is comprised.
Record cell 804 is used to write down described content sig ID and corresponding file bag resource URL and file bag resource identification information, forms database.
The process of using preferred embodiment spanned file bag resource database shown in Figure 8 can comprise:
Step e 1, resource URL placement unit obtain the URL of literature kit in the Webpage, and content sig ID computing unit obtains the content sig ID of described literature kit by the content-data of pre-defined algorithm calculation document bag;
Wherein, described pre-defined algorithm can obtain different disposal result's algorithm for the content-data of handling different binary files.
Step e 2, resource identification information acquiring unit obtain the resource identification information of described literature kit;
Wherein, described resource identification information can satisfy the name information of a pre-conditioned candidate object for attribute information, and described candidate object can comprise file and/or the file that described literature kit comprises; Preferably, file that described literature kit comprised and/or file can for, file and/or file that the root directory layer of described literature kit is comprised.
Step e 3, the described content sig ID of recording unit records and corresponding file bag resource URL and file bag resource identification information form database.
With reference to figure 9, show the process flow diagram of the searching method embodiment of a kind of file bag resource of the present invention, can may further comprise the steps:
Step 901, initialized data base;
Preferably, described database can comprise content sig ID and corresponding file bag URL and file bag resource identification information, and described file bag resource URL obtains by the file bag resource that grasps in the Webpage; Described content sig ID obtains by the content-data calculating of pre-defined algorithm to literature kit, and described pre-defined algorithm obtains different disposal result's algorithm for the content-data of handling different binary files; Described file bag resource identification information is the name information that attribute information satisfies a pre-conditioned candidate object, and described candidate object comprises file and/or the file that described literature kit comprises.Preferably, file that described literature kit comprised and/or file can for, file and/or file that the root directory layer of described literature kit is comprised.
Step 902, the search key of importing according to the user mate corresponding file bag resource identification information in described database;
Step 903, by the coupling file bag resource identification information search corresponding file bag URL; Or, by searching corresponding content sig ID, and obtain corresponding literature kit URL according to described content sig ID.
Preferably, described database can be arranged on server end, has stored the corresponding relation of content sig ID and file bag resource identification information and URL information.
Use this database carry out data search process can for, receive the search key that the user submits to, the file bag resource identification information of storing in this key word and the database is mated, if find occurrence, then find corresponding content sig ID, extract corresponding URL according to this content sig ID then by this document bag resource identification information.
Preferably, present embodiment can also comprise step 904: from the data of described URL download file resource.
The user in client downloads after the literature kit, can also download the operation that renames, be specially, calculate the content sig ID in client for this document bag, then this content sig ID is submitted to server, server is searched this content sig ID corresponding file bag resource identification information in database, if exist, then this identification information is returned to client.After client received the information that server returns, whether the prompting user renamed, if the user confirms to rename, then client is revised as the resource identification information that server returns with the name information of this document bag resource automatically.
With reference to Figure 10, show the structured flowchart of the searcher embodiment of a kind of file bag resource of the present invention, can comprise with lower unit:
Database 1001, described database comprise content sig ID and corresponding file bag resource URL and file bag resource identification information;
Wherein, described file bag resource URL can obtain by the file bag resource that grasps in the Webpage; Described content sig ID can obtain by the content-data calculating of pre-defined algorithm to literature kit, and described pre-defined algorithm can obtain different disposal result's algorithm for the content-data of handling different binary files; Described file bag resource identification information can satisfy the name information of a pre-conditioned candidate object for attribute information, and described candidate object can comprise file and/or the file that described literature kit comprises; Preferably, file that described literature kit comprised and/or file can for, file and/or file that the root directory layer of described literature kit is comprised.
Matching unit 1002 is used for mating corresponding file bag resource identification information according to the search key of user's input at described database;
Search unit 1003, be used for searching corresponding file bag URL by the file bag resource identification information of coupling; Or, by searching corresponding content sig ID, and obtain corresponding literature kit URL according to described content sig ID.Preferably, in the present embodiment, can also comprise:
Download unit is used for from the data of described URL download file resource.
The process of using preferred embodiment search file bag resource shown in Figure 10 can comprise:
Step F 1, initialized data base, described database comprise content sig ID and corresponding file bag resource URL and file bag resource identification information;
Wherein, described file bag resource URL can obtain by the file bag resource that grasps in the Webpage; Described content sig ID can obtain by the content-data calculating of pre-defined algorithm to literature kit, and described pre-defined algorithm can obtain different disposal result's algorithm for the content-data of handling different binary files; Described file bag resource identification information can satisfy the name information of a pre-conditioned candidate object for attribute information, and described candidate object can comprise file and/or the file that described literature kit comprises; Preferably, file that described literature kit comprised and/or file can for, file and/or file that the root directory layer of described literature kit is comprised.
Step F 2, matching unit mate corresponding file bag resource identification information according to the search key of user's input in described database;
Step F 3, search the unit by the coupling file bag resource identification information search corresponding file bag URL; Or, by searching corresponding content sig ID, and obtain corresponding literature kit URL according to described content sig ID.
Need to prove, in the above-described embodiments, the description of each embodiment is all emphasized particularly on different fields do not have the part that describes in detail among certain embodiment, can be referring to the associated description of other embodiment.In addition, for each device embodiment, because it is substantially corresponding to its method embodiment, so description is fairly simple, relevant part gets final product referring to the part explanation of method embodiment.
To sum up, the embodiment of the invention is according to each file of literature kit root directory layer and the size of file, choose maximum, perhaps, size satisfies the name information of the file or folder of certain threshold value, be the resource identification information of current file bag, promptly choose the identification information of the name information of the file or folder of weight maximum in the whole file bag resource, can obtain more representative file bag resource identification as literature kit; The embodiment of the invention can also be further from described maximum, perhaps, size satisfies the name information of the file or folder of certain threshold value, in the name information of current file bag, choose the name information of file or folder more accurately according to presetting rule, as the identification information of literature kit, thereby further improved the accuracy that file bag resource identification information is determined; Moreover the embodiment of the invention can also generate database by the described file bag resource identification information of record, in practice, can utilize this database to carry out file bag resource search and down operation.Because the accuracy of file bag resource identification information is higher, thereby can improve the accuracy of Search Results; Because described identification information is corresponding to the content sig ID record of literature kit, this content sig ID can be avoided the literature kit re-treatment that content is identical, thereby can effectively improve search efficiency.
More than the searching method and the device of the method for the generation of the method for obtaining file bag resource identification information provided by the present invention and device, file bag resource database and device, file bag resource is described in detail, used specific case herein principle of the present invention and embodiment are set forth, the explanation of above embodiment just is used for helping to understand method of the present invention and core concept thereof; Simultaneously, for one of ordinary skill in the art, according to thought of the present invention, the part that all can change in specific embodiments and applications, in sum, this description should not be construed as limitation of the present invention.