CN111475464B - Method for automatically finding and mining fingerprints of Web component - Google Patents

Method for automatically finding and mining fingerprints of Web component Download PDF

Info

Publication number
CN111475464B
CN111475464B CN202010197426.6A CN202010197426A CN111475464B CN 111475464 B CN111475464 B CN 111475464B CN 202010197426 A CN202010197426 A CN 202010197426A CN 111475464 B CN111475464 B CN 111475464B
Authority
CN
China
Prior art keywords
component
website
file
fingerprint
library
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010197426.6A
Other languages
Chinese (zh)
Other versions
CN111475464A (en
Inventor
陈龙
周双飞
夏书银
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Post and Telecommunications
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN202010197426.6A priority Critical patent/CN111475464B/en
Publication of CN111475464A publication Critical patent/CN111475464A/en
Application granted granted Critical
Publication of CN111475464B publication Critical patent/CN111475464B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • G06F16/137Hash-based
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • G06F16/986Document structures and storage, e.g. HTML extensions
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Storage Device Security (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a method for automatically finding and mining fingerprints of Web components, belonging to the field of computer networks. The method comprises the following steps: collecting website webpage data of different domain names, and storing the website webpage data into a website webpage database; calculating any digital abstract (Hash value) of a static file of a JS file, a CSS file and a picture which are unique in source codes of all open source components; extracting a data digital abstract of Count > N (N is a natural number greater than 2) in a website static file feature library, and sequentially matching the digital abstract with the data digital abstract in the component source code file feature database; extracting special file path characteristics and key word characteristic information of the component in a component source code file characteristic library based on a website-component association library, and matching each piece of characteristic information in a large number of websites containing the component; and selecting the characteristic information with more hits in the component fingerprint library to be selected, and adding the characteristic information into the component fingerprint library. The invention can realize automatic discovery of the fingerprint of the mining Web component.

Description

Method for automatically finding and mining fingerprints of Web component
Technical Field
The invention belongs to the field of computer networks, and relates to a method for automatically finding and mining fingerprints of Web components.
Background
The website is composed of components, and servers, databases, web containers, plugins, middleware, etc. all belong to the website components. When identifying which components of the website are formed, the method is generally implemented by adopting a mode based on component fingerprint matching, wherein the component fingerprint refers to a certain piece of information capable of uniquely identifying the components and can be static file Hash values (JS files, CSS files, pictures and the like), special file paths, key fields and the like which are unique to the components. When the component fingerprint matching is successful, it is indicated that the website is using the component.
Among the above several fingerprints, the use of static file Hash value recognition components is the most accurate recognition method. The present invention extends based on this feature.
The richness and the accuracy of the component fingerprint library form main constraint conditions for component fingerprint identification. The rapid component addition and component version changes result in component fingerprints being added and changed, so that the acquisition of the component fingerprints is a time-consuming and labor-consuming project.
Defects and deficiencies of the prior art:
the existing component fingerprint discovery is mainly finished by manual labeling, so that each Web component fingerprint identification platform or open source tool has the function or the way of submitting the component fingerprint, and the defects are high cost and low efficiency.
Disclosure of Invention
In view of the above, the invention aims to provide a method for automatically finding and mining the fingerprint of a Web component, which solves the main problems of automatically finding the fingerprint of the component, completing the work efficiently and with low cost, getting rid of the situation of manually marking the fingerprint of the component and achieving the purposes of automatically finding as a main part and manually marking as an auxiliary part.
In order to achieve the above purpose, the present invention provides the following technical solutions:
a method of automatically discovering a fingerprint of a mined Web component, the method comprising the steps of:
1) Establishing a webpage database, a website static file numerical summary, namely a feature library of a Hash value, a component source code file feature library, a website_component association library, a component fingerprint library and a component fingerprint library to be selected;
2) Collecting website webpage data under different domain names, and storing the website webpage data into a webpage database;
3) The processing of website data comprises the following steps:
3.1 Calculating Hash values of static files of JavaScript language files, cascading Style Sheets (CSS) files and pictures of the website, and special file path characteristics and keyword characteristics; a website has a plurality of static file numerical summary values, namely Hash values, special file paths and keyword characteristics;
3.2 Storing the calculated Hash value into a website static file feature library, and increasing the Count by 1 if the Hash value exists in the database;
4) Calculating Hash values of unique JS files, CSS files and static files of pictures in each open source component source code file and characteristic file path characteristic key word characteristics, and storing calculation results into a component source code file characteristic database; one component has a number of static file Hash values;
5) The Hash value characteristic of the static file of the website is compared and matched with the Hash value of the component source code file, and the method comprises the following steps:
5.1 Extracting and counting a piece of Hash value data of Count > N from a website static file feature library, wherein N is any natural number greater than 2;
5.2 Comparing and matching the Hash value extracted in 5.1 with the Hash value in the component source code file characteristic database in sequence, and if the two Hash values are the same, successfully matching;
5.3 If the matching in the step 5 is successful, the Hash value which is successfully matched is used as the fingerprint of the component to be written into a component fingerprint library; meanwhile, marking the component identifier for the website containing the Hash value in the website static file characteristic database, associating the component with the website, and writing the association result into the website_component association database; ending the round of matching, extracting the Hash value of the next Count > N for matching until the Hash values of all the Count > N are matched with the Hash values in the component source code file feature library; if the matching fails, the step is shifted to step 5.1;
6) Extracting all relevant website information of a certain component from a website_component association library, and extracting corresponding website webpage data from a webpage database;
7) Extracting special file path characteristics and key word characteristics of the component in a component source code file characteristic database based on the extracted component, and sequentially carrying out characteristic matching on the extracted characteristics in the extracted website webpage data; if the matching is successful, writing the feature into a fingerprint library of the component to be selected, and increasing the feature Count by 1 every time each feature is successfully matched in webpage data of different websites;
8) And selecting the characteristics of hit number Count > M (M is any natural number greater than 2) in the component fingerprint library to be selected and writing the characteristics into the component fingerprint library.
Optionally, the static file feature of the website includes Hash values of static files of JavaScript language files, cascading style sheet CSS files and pictures of the website.
Optionally, the component source code file features include Hash values of static files of the JS file, the CSS file and the picture unique to the component source code file, and unique file path features and key word features.
Optionally, the method for judging whether the Hash value of the static file is the component fingerprint is as follows: sequentially comparing the Hash value of the static file of the component with Hash values of static files appearing in different websites for a plurality of times, and judging whether the Hash values of the static file and the Hash values of the static file are the same; if yes, the Hash value is judged to be the component fingerprint.
Optionally, the component fingerprint mining method comprises the following steps: performing feature matching on a special file path and key word features in the source code file features of the component in a large amount of website data containing the component; if the matching is successful, judging whether the feature exists in the fingerprint database of the component to be selected; if yes, the feature Count is increased by 1, and if not, the feature Count is written into the fingerprint library of the component to be selected.
Optionally, the selection method of the special file path and the keyword features is as follows: and selecting a special file path with hit times of a plurality of times, and writing the key word fingerprint into a component fingerprint library.
The invention has the beneficial effects that: according to the invention, the static file Hash value fingerprints of the component are found in a Hash value comparison mode through mathematical statistics, and the component can be continuously found to be used as a special file path and key character of the component fingerprints on the basis of finding the static file Hash value fingerprints. Thereby automatically finding the fingerprint of the mining component.
Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objects and other advantages of the invention may be realized and obtained by means of the instrumentalities and combinations particularly pointed out in the specification.
Drawings
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in the following preferred detail with reference to the accompanying drawings, in which:
FIG. 1 is a flow chart for discovering new component fingerprints;
FIG. 2 is a component fingerprint mining flow.
Detailed Description
Other advantages and effects of the present invention will become apparent to those skilled in the art from the following disclosure, which describes the embodiments of the present invention with reference to specific examples. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention. It should be noted that the illustrations provided in the following embodiments merely illustrate the basic idea of the present invention by way of illustration, and the following embodiments and features in the embodiments may be combined with each other without conflict.
Wherein the drawings are for illustrative purposes only and are shown in schematic, non-physical, and not intended to limit the invention; for the purpose of better illustrating embodiments of the invention, certain elements of the drawings may be omitted, enlarged or reduced and do not represent the size of the actual product; it will be appreciated by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
The same or similar reference numbers in the drawings of embodiments of the invention correspond to the same or similar components; in the description of the present invention, it should be understood that, if there are terms such as "upper", "lower", "left", "right", "front", "rear", etc., that indicate an azimuth or a positional relationship based on the azimuth or the positional relationship shown in the drawings, it is only for convenience of describing the present invention and simplifying the description, but not for indicating or suggesting that the referred device or element must have a specific azimuth, be constructed and operated in a specific azimuth, so that the terms describing the positional relationship in the drawings are merely for exemplary illustration and should not be construed as limiting the present invention, and that the specific meaning of the above terms may be understood by those of ordinary skill in the art according to the specific circumstances.
The invention aims to overcome the defects of the traditional component fingerprint discovery and provides a method for automatically discovering and updating the component fingerprint. And counting the occurrence times of static files such as JS, CSS, pictures and the like by collecting website data of different domain names. And matching the static file Hash value with the frequency greater than N with the static file value of the component. And if the matching is successful, taking the Hash value of the static file as the component fingerprint, and further discovering other fingerprints of the component on the basis of discovering the Hash value fingerprint of the static file of the component, thereby realizing the automatic discovery of the component fingerprint.
The present invention will be described in further detail with reference to the accompanying drawings and detailed description, and as shown in fig. 1 and 2, the method for automatically finding a fingerprint of a mining component according to the present invention is a flowchart, wherein the method includes:
as shown in fig. 1, it is assumed that web page data collection starts.
And step A, acquiring website data, and obtaining a static file Hash value after processing the data, wherein the storage example of the current website static file feature library is assumed as follows.
Static filename Static file Hash value Site_url Counting
A Hash value url1,url2 2
B Hash value url 1
C Hash value url 1
When a Hash value of a static file is repeatedly collected every time, the website address of the static file is updated into site_url, and the Count is increased by 1, for example: in the acquisition process, the Hash value of the static file A is acquired again, and the database record is changed as follows:
static filename Static file Hash value Site_url Counting
A Hash_A value url1,url2,url3,url4 4
B Hash_B value url 1
C Hash_c value url 1
And B, processing the source code file of the open source component, and calculating and analyzing the unique static file Hash value and other characteristics of the component, wherein the unique static file Hash value and other characteristics mainly comprise file path characteristics and key field characteristics. The storage examples of the component source code file feature database are as follows:
Figure SMS_1
and C, when a certain piece of data in the website static file feature database counts Count > N (N is a natural number which is arbitrarily more than 2), extracting a Hash value of the piece of data. Component fingerprints are significant in that they can be identified multiple times in each web site, thus defining N >2. If N is 3, only static files A meeting the conditions in the static file feature library of the website are obtained, so that the Hash value Hash_A of A is extracted.
And D, sequentially comparing and matching the extracted Hash value with the static file Hash values of the components (component 1 and component 2.).
If the matching is successful, the following two operations are performed
1) Writing the Hash_A as the fingerprint of the successful component into a component fingerprint library; the example hash_a successfully matches the static file in the component 1, and the component fingerprint database data record becomes:
component name Hash value fingerprint File path feature fingerprint Key field feature fingerprint
Assembly 1 Hash_A null null
2) And extracting website information containing the Hash_A from the website static file feature library based on the Hash_A, labeling the website with the component, and writing the website information into the website_component association library. The website static file feature library comprises 4 pieces of website information of hash_A, namely url1, url2, url3 and url4", so that the data record of the website_component association library is changed into:
Site_url component list
url1 Assembly 1
url2 Assembly 1
url3 Assembly 1
url4 Assembly 1
And ending the round of circulation, and selecting the Hash value of the data of the next Count > N for comparison and matching.
If the matching fails, the step C is switched to.
And E, extracting the Hash value fingerprint in the component fingerprint database, and obtaining the website containing the Hash value fingerprint of the component from the website_component association database. If the number of websites is large, L websites are selected, otherwise, all websites are selected. And matching the file path characteristics of the component with the keyword characteristics in the L website webpage data.
If the matching is successful, the characteristic information is added into the fingerprint library of the component to be selected, and the data stored in the fingerprint library of the component to be selected currently is assumed to be as follows:
component name Features (e.g. a character) Type Count
Assembly 1 Feature 1_1 keyword 1
Assembly 1 Feature 1_2 path 5
Assembly 2 Feature 2_1 keyword 8
Examples: feature 1_1 of component 1 again successfully hits once in the website page data, then the record in the change database is:
Figure SMS_2
Figure SMS_3
and F, selecting a plurality of times of hits in the fingerprint library of the component to be selected for each component, namely counting a plurality of pieces of characteristic information with larger Count, setting a threshold Y, dynamically changing the threshold Y along with the number of matched websites, and writing the component characteristics of the Count > Y into the fingerprint library of the component. Examples: for component 1, its feature 1_2 performs better and satisfies Count > Y, the other type fields are key words, are key field feature fingerprints, and the database data record is updated as:
component name Hash value fingerprint File path feature fingerprint Key field feature fingerprint
Assembly 1 Hash_A null Feature 1_2
Through the steps, a reliable component fingerprint library is finally obtained automatically. Fingerprint of example component 1 "Hash value fingerprint: hash_a, key field feature fingerprint: feature 1_2 "can be used for actual Web component fingerprinting.
Finally, it is noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the present invention, which is intended to be covered by the claims of the present invention.

Claims (4)

1. A method for automatically discovering a fingerprint of an mined Web element, comprising: the method comprises the following steps:
1) Establishing a webpage database, a website static file numerical summary, namely a feature library of a Hash value, a component source code file feature library, a website_component association library, a component fingerprint library and a component fingerprint library to be selected;
2) Collecting website webpage data under different domain names, and storing the website webpage data into a webpage database;
3) The processing of website data comprises the following steps:
3.1 Calculating Hash values of static files of JavaScript language files, cascading Style Sheets (CSS) files and pictures of the website, and special file path characteristics and keyword characteristics; a website has a plurality of static file numerical summary values, namely Hash values, special file paths and keyword characteristics;
3.2 Storing the calculated Hash value into a website static file feature library, and increasing the Count by 1 if the Hash value exists in the database;
4) Calculating Hash values of unique JS files, CSS files and static files of pictures in each open source component source code file, and special file path characteristics and key word characteristics, and storing calculation results into a component source code file characteristic database; one component has a number of static file Hash values;
5) The Hash value characteristic of the static file of the website is compared and matched with the Hash value of the component source code file, and the method comprises the following steps:
5.1 Extracting and counting a piece of Hash value data of Count > N from a website static file feature library, wherein N is any natural number greater than 2;
5.2 Comparing and matching the Hash value extracted in 5.1 with the Hash value in the component source code file characteristic database in sequence, and if the two Hash values are the same, successfully matching;
5.3 If the matching in the step 5 is successful, the Hash value which is successfully matched is used as the fingerprint of the component to be written into a component fingerprint library; meanwhile, marking the component identifier for the website containing the Hash value in the website static file characteristic database, associating the component with the website, and writing the association result into the website_component association database; ending the round of matching, extracting the Hash value of the next Count > N for matching until the Hash values of all the Count > N are matched with the Hash values in the component source code file feature library; if the matching fails, the step is shifted to step 5.1;
6) Extracting all relevant website information of a certain component from a website_component association library, and extracting corresponding website webpage data from a webpage database;
7) Extracting special file path characteristics and key word characteristics of the component in a component source code file characteristic database based on the extracted component, and sequentially carrying out characteristic matching on the extracted characteristics in the extracted website webpage data; if the matching is successful, writing the feature into a fingerprint library of the component to be selected, and increasing the feature Count by 1 every time each feature is successfully matched in webpage data of different websites;
8) Selecting the characteristics of hit times Count > M in the component fingerprint library to be selected, and writing the characteristics into the component fingerprint library, wherein M is any natural number greater than 2;
the component fingerprint mining method comprises the following steps: performing feature matching on a special file path and key word features in the source code file features of the component in a large amount of website data containing the component; if the matching is successful, judging whether the feature exists in the fingerprint database of the component to be selected; if yes, the feature Count is increased by 1, and if not, the feature Count is written into a fingerprint library of the component to be selected;
the selection method of the special file path and the keyword features comprises the following steps: and selecting a special file path with hit times of a plurality of times, and writing the key word fingerprint into a component fingerprint library.
2. A method of automatically discovering a fingerprint of a mined Web component as defined in claim 1, wherein: the website static file features comprise Hash values of JavaScript language files, cascading Style Sheet (CSS) files and static files of pictures of the website.
3. A method of automatically discovering a fingerprint of a mined Web component as defined in claim 1, wherein: the component source code file features comprise Hash values of static files of a JS file, a CSS file and a picture which are unique in the component source code file, and unique file path features and key word features.
4. A method of automatically discovering a fingerprint of a mined Web component as defined in claim 1, wherein: the judging method for judging whether the Hash value of the static file is the component fingerprint comprises the following steps: sequentially comparing the Hash value of the static file of the component with Hash values of static files appearing in different websites for a plurality of times, and judging whether the Hash values of the static file and the Hash values of the static file are the same; if yes, the Hash value is judged to be the component fingerprint.
CN202010197426.6A 2020-03-19 2020-03-19 Method for automatically finding and mining fingerprints of Web component Active CN111475464B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010197426.6A CN111475464B (en) 2020-03-19 2020-03-19 Method for automatically finding and mining fingerprints of Web component

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010197426.6A CN111475464B (en) 2020-03-19 2020-03-19 Method for automatically finding and mining fingerprints of Web component

Publications (2)

Publication Number Publication Date
CN111475464A CN111475464A (en) 2020-07-31
CN111475464B true CN111475464B (en) 2023-04-25

Family

ID=71747637

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010197426.6A Active CN111475464B (en) 2020-03-19 2020-03-19 Method for automatically finding and mining fingerprints of Web component

Country Status (1)

Country Link
CN (1) CN111475464B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112131508A (en) * 2020-09-25 2020-12-25 深信服科技股份有限公司 Method, equipment, device and medium for identifying fingerprint of website application framework
CN113946566B (en) * 2021-12-20 2022-03-18 北京大学 Web system fingerprint database construction method and device and electronic equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103065095A (en) * 2013-01-29 2013-04-24 四川大学 WEB vulnerability scanning method and vulnerability scanner based on fingerprint recognition technology
CN108628722A (en) * 2018-05-11 2018-10-09 华中科技大学 A kind of distributed Web Component services detection system
CN110489701A (en) * 2019-08-19 2019-11-22 安徽三实信息技术服务有限公司 Extract the method, apparatus and CMS recognition methods of CMS identification feature

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9165124B1 (en) * 2012-02-01 2015-10-20 Convertro, Inc. Systems and methods for identifying a returning web client
US11386181B2 (en) * 2013-03-15 2022-07-12 Webroot, Inc. Detecting a change to the content of information displayed to a user of a website
US10635426B2 (en) * 2017-03-17 2020-04-28 Microsoft Technology Licensing, Llc Runtime deployment of payloads in a cloud service

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103065095A (en) * 2013-01-29 2013-04-24 四川大学 WEB vulnerability scanning method and vulnerability scanner based on fingerprint recognition technology
CN108628722A (en) * 2018-05-11 2018-10-09 华中科技大学 A kind of distributed Web Component services detection system
CN110489701A (en) * 2019-08-19 2019-11-22 安徽三实信息技术服务有限公司 Extract the method, apparatus and CMS recognition methods of CMS identification feature

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"一种有效的Web指纹识别方法";闫淑筠 等;《中国科学院大学学报》;20160915;第33卷(第5期);全文 *

Also Published As

Publication number Publication date
CN111475464A (en) 2020-07-31

Similar Documents

Publication Publication Date Title
CN107229668B (en) Text extraction method based on keyword matching
WO2019214245A1 (en) Information pushing method and apparatus, and terminal device and storage medium
Adelfio et al. Schema extraction for tabular data on the web
US20120102015A1 (en) Method and System for Performing a Comparison
CN106033416A (en) A string processing method and device
CN101796480A (en) Integrating external related phrase information into a phrase-based indexing information retrieval system
CN105589894B (en) Document index establishing method and device and document retrieval method and device
Xie et al. Fast and accurate near-duplicate image search with affinity propagation on the ImageWeb
US7711719B1 (en) Massive multi-pattern searching
CN111475464B (en) Method for automatically finding and mining fingerprints of Web component
CN113254751B (en) Method, equipment and storage medium for accurately extracting complex webpage structured information
CN111325018B (en) Domain dictionary construction method based on web retrieval and new word discovery
CN107463711A (en) A kind of tag match method and device of data
CN109165373B (en) Data processing method and device
CN111680506A (en) External key mapping method and device of database table, electronic equipment and storage medium
CN109815337B (en) Method and device for determining article categories
Dai et al. Validating multi-column schema matchings by type
Moia et al. The impact of excluding common blocks for approximate matching
Kaur et al. Assessing lexical similarity between short sentences of source code based on granularity
CN115687579B (en) Document tag generation and matching method, device and computer equipment
CN107577667B (en) Entity word processing method and device
CN110941713A (en) Self-optimization financial information plate classification method based on topic model
Asadi et al. Pattern-based extraction of addresses from web page content
CN112131215B (en) Bottom-up database information acquisition method and device
WO2000034897A1 (en) System and method for finding near matches among records in databases

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant