CN104063488B - A kind of form feature extracting method of semi-automatic learning type - Google Patents

A kind of form feature extracting method of semi-automatic learning type Download PDF

Info

Publication number
CN104063488B
CN104063488B CN201410317562.9A CN201410317562A CN104063488B CN 104063488 B CN104063488 B CN 104063488B CN 201410317562 A CN201410317562 A CN 201410317562A CN 104063488 B CN104063488 B CN 104063488B
Authority
CN
China
Prior art keywords
markup language
learning device
language processing
semi
processing unit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201410317562.9A
Other languages
Chinese (zh)
Other versions
CN104063488A (en
Inventor
陈超
陈超一
范渊
吴永越
郑学新
姜毅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu DBAPPSecurity Co Ltd
Original Assignee
Chengdu DBAPPSecurity Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu DBAPPSecurity Co Ltd filed Critical Chengdu DBAPPSecurity Co Ltd
Priority to CN201410317562.9A priority Critical patent/CN104063488B/en
Publication of CN104063488A publication Critical patent/CN104063488A/en
Application granted granted Critical
Publication of CN104063488B publication Critical patent/CN104063488B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation
    • G06F16/9577Optimising the visualization of content, e.g. distillation of HTML documents

Abstract

The invention discloses a kind of form feature extracting method of semi-automatic learning type, comprise the following steps:(1)Start learning device;(2)The position of input marking language file;(3)Learning device loads making language document;(4)Generate markup language aggregate;(5)In study module insertion making language document;(6)List is operated, complete documentation generates characteristic information;(7)Form structure information is stored in database;(8)Form feature study is completed.The method of the invention be able to can be extracted with integrality, authenticity, the web form architectural feature of accuracy by way of manually participating in, with semi-automatic machine learning markup language form structure;Submission form is completed by learning device, and form feature extracts and is difficult failure;Make<input>Input frame quilt<form>Label is wrapped up, so that browser, which is sent after webpage loaded is notified, can meet the rule of static scanning, can be well on inquiry.

Description

A kind of form feature extracting method of semi-automatic learning type
Technical field
The present invention relates to machine learning, data mining, online experience field, a kind of semi-automatic learning type is specifically referred to Form feature extracting method.
Background technology
With the popularization of Internet information technique and popular, by browser access retrieved web information with exchange As one of required skill for improving modern society's productivity.
When accessing retrieved web information, it may be necessary to frequently input information to website, such as:User logs in, deliver and comment By, take part in a vote, some information need repeat and frequently enter, such as:User logs in, in different websites it is necessary to defeated Enter the information such as different user name or password;And shopping online, buy different commodity it is necessary to repeatedly input oneself address, The information such as postcode, consignee's name.
Because these information may need frequent, substantial amounts of input, and information has unicity, such as shopping online, from Oneself address generally will not often change, and name is even more so, so outside almost all of modern markup language processing unit The Man Machine Interface of shell, i.e. markup language processing unit, such as browser interface are filled out there is provided automated log on and list automatic generation Function, mitigates the duplication of labour burden of the mankind, improves production efficiency.
If markup language processing unit shell is needed data Auto-writing to the list in markup language processing unit In, it must be understood that the list project corresponding to relevant entry, such as:Addressee's name correspondence the 1st input frame, address of the addressee pair Answer the 2nd input frame, addressee's postcode the 3rd input frame of correspondence., just must be it is to be understood that the structure of list be special under such rule Levy, correctly could fill in data in corresponding project.
The HTML that World Wide Web Consortium is proposed, i.e. HTML, referred to as " markup language ", language standard makes internet The web page files that can be made up of unified, standardization language generation by marking, referred to as " tab file ".Html language is to set There is provided a series of standard base part on the basis of the label of shape structure, as long as markup language processing unit realizes that HTML is marked It is accurate, it is possible to keep versatility.
When loading the making language document of website using markup language processing unit, if necessary to submit number to website According to, such as chat, make comments, buy and sell commodity, preserve customized information, website must just provide collection browser data collection The approach of data, " list is provided for this html language standard(form)" part, a list generally comprises following element:< form>:It is a list to state this, and the data among this can be submitted to server;<input>:<form>The son section of label Point, it is a single file text input frame to state this, according to type attributes, can show different patterns, such as:<input type= text>, a common input frame;<input type=password>, one conceal input content Password Input frame; Submission form button:Submission form is actually<input>One type attribute of label, when<input>The type attribute quilts of label When being set to submit, a button can be showed in markup language processing unit, can be by when button is activated<form> It is all legal in label<input>The data of user's input are all submitted to server.
Existing characteristic analysis method, as shown in figure 1, leading to whenever markup language processing unit sends tab file loaded When knowing, it is assumed that the page occurs the content for including above element, then the interface provided by markup language processing unit to mark Note file is analyzed, and takes out list<from><input>Feature, but such a method is in the dynamically labeled loading of high speed development Seem before technological side unable to do what one wishes, because dynamically labeled loading technique can cause problems with:
Markup language processing unit is sent after webpage loaded notice, does not have the content of login frame in tab file, And the markup language required for list is presented actually is continuing loading by the JavaScript scripts in tab file, also It is to say, the markup language set required for list is now presented does not have real loading and completed, and can be lost so form feature is extracted Lose;
Submitting button is not<input type=submit>, it may be possible to any one, which is added, calls JavaScript The html tag of scripted code, and submission form is completed by JavaScript scripts, can be lost so form feature is extracted Lose;
Even<input>Input frame does not have quilt yet<form>Label is wrapped up.This, which results in browser and sends webpage, adds Load can not meet the rule of static scanning after finishing notice, cause inquiry to fail.
The content of the invention
It is an object of the invention to by way of manually participating in there is provided one kind can further extract with integrality, The form feature extracting method of authenticity, the semi-automatic learning type of the web form architectural feature of accuracy.
The present invention is achieved through the following technical solutions a kind of form feature extracting method of semi-automatic learning type, including with Lower step:
(1)Start learning device, learning device built-in token language processing apparatus;
(2)In the position of address field input marking language file;
(3)Learning device loads making language document by built-in browser;
(4)After the completion of loading, built-in browser notifies the loading of learning device making language document to complete, and generates mark language Say aggregate;
(5)Learning device inserts study module in the making language document loaded;
(6)List is operated, by learning device complete documentation, and the characteristic information of correlation is generated;
(7)Receive after submitting button click event, study module thinks that study is completed, and form structure information is stored in into data Storehouse;
(8)Whole list feature learning process is completed.
The above method handles the learning device of markup language device by manufacturing built in one, determine markup language, marks The label that input frame is presented in language processing apparatus selection is defaulted as<input>Label.
In semi-automatic learning process, machine simultaneously need not recognize when webpage loads completion, but artificially be sentenced It is disconnected.
When seeing that markup language processing unit indicates a need for the list of fill substance, the label language of form structure is presented Speech set has necessarily completely been present in markup language processing unit and suffered.
Imparting indicia language processing apparatus, when any<input>When label is activated, notify what learning device was activated< input>The object of label.
Learning device is by having activated<input>Label object, reads the attribute of this label.
Learning device has currently been activated by traveling through the markup language set in markup language processing unit, calculating<input >Absolute position of the label in markup language set.
Imparting indicia language processing apparatus, when producing list submission event, should not be committed to server, but notify to learn Practising device list submits event to be produced by which object.
In learning device, activation successively needs the input frame of fill substance, during this, and the input frame being activated will It is recorded, was not activated, reconditioning will be ignored.Submitting button is clicked on, list is produced and submits event, learning device is received To after event, the input frame information recorded in upper step and the corresponding URL of current markers file are stored in form feature database.
So far, study is completed.
Learning device can be interacted by this part with markup language processing unit, learn web form feature, and It is stored in form feature database.
No matter which kind of engine, be worth, will finally be integrated into service environment to its performance, therefore, engine can be external Offer enables third party device to operate the operate interface of oneself.
According to JavaScript language standard, when being clicked on using controller in markup language processing unit, it can produce A raw onClick event.
According to JavaScript language standard, when producing onClick events, a function can be called, and will triggering OnClick object passes to function by parameter, allows JavaScript language according to this event action object.
A JavaScript function is write, this function can travel through the label pair in current markers language processing apparatus always As, and with oneself onClick processing function registration input labels, button labels, a labels, img labels onClick things Part, so as to the HTML controls of dynamic load after handling.
A JavaScript function is write, this function is responsible for collecting the information that onClick processing functions are sent out.
When markup language processing unit confirmation flag language file, which is loaded, to be finished, learning device is by handling mark language The privately owned JavaScript interface that learning device is provided is registered to markup language processing unit by the interface that speech device itself is provided, This privately owned interface can make the JavaScript engine in markup language processing unit be communicated with learning device, privately owned be connect by this The label information being collected into can be sent to learning device by the JavaScript engine in mouth current markers file.
Further, the markup language processing unit of entity built in the learning device.
Further, the markup language processing unit of non-physical built in the learning device.
Further, the markup language processing unit is provided with operate interface.
Further, the markup language processing unit default label language is HTML.
Further, the markup language processing unit is Trident engines, and the operate interface connects for WebControl Mouthful.Had the markup language processing units of many maturations at present, these devices include Microsoft Trident engines, The Blink engines of Google companies, the Gecko engines of Mozilla foundations, the WebKit engines of Apple Inc. and other phases The privately owned entity or virtual engine of Guan Hangye companies, and different markup language processing units is provided with corresponding interface, plants class name Title is various, and preferred markup language processing unit is Trident engines here, and its interface is corresponding WebControl interfaces.
Further, the built-in browser is IE browser.
Further, the markup language aggregate is JavaScript content for script.
The present invention compared with prior art, with advantages below and beneficial effect:
(1)The method of the invention can be by way of manually participating in, with semi-automatic machine learning markup language table Single structure, can be extracted with integrality, authenticity, the web form architectural feature of accuracy;
(2)Submitting button used in the method for the invention is<input type=submit>, submission form is by study dress Completion is put, form feature extracts and is difficult failure;
(3)The method of the invention makes<input>Input frame<form>Label is wrapped up, so that browser sends webpage Loaded can meet the rule of static scanning after notifying, can be well on inquiry.
Brief description of the drawings
Fig. 1 is markup language processing unit workflow;
Fig. 2 is the learning device workflow with markup language learning device;
Fig. 3 is markup language COLLECTION TRAVERSALSThe function flow;
Fig. 4 is " click " event handling function flow.
Embodiment
The present invention is described in further detail with reference to embodiment, but the implementation of the present invention is not limited to this.
Embodiment:
Existing characteristic analysis method, as shown in figure 1, leading to whenever markup language processing unit sends tab file loaded When knowing, it is assumed that the page occurs the content for including above element, then the interface provided by markup language processing unit to mark Note file is analyzed, and takes out list<from><input>Feature, but such a method is in the dynamically labeled loading of high speed development Seem before technological side unable to do what one wishes.
Present embodiment discloses a kind of form feature extracting method of semi-automatic learning type, this method can be by that can lead to The mode manually participated in is crossed, with semi-automatic machine learning markup language form structure, can be extracted with integrality, truly Property, the web form architectural feature of accuracy.Specific implementation step is:
(1)Start learning device, can be appreciated that the human-computer interaction interface of a similar IE browser;
(2)In address field input marking language file, telltale mark language file position;
(3)Device loads making language document by built-in IE browser;
(4)After the completion of, built-in IE browser notifies the loading of learning device making language document to complete, and has generated mark Remember language aggregate;
(5)Learning device inserts study module in the making language document loaded;
(6)Operate list, such as fill substance, choose an option, click on submitting button, these processes will by study fill Put complete documentation, or and generate the related characteristic information such as tag name, attribute, absolute position;
(7)Receive after submitting button click event, study module thinks that study is completed, and the characteristic information of form structure is deposited Enter database.Whole list feature learning process is completed.
The learning device workflow of markup language learning device is wherein carried, as shown in Fig. 2 default label language is HTML, the label that input frame is presented in the selection of markup language processing unit is defaulted as<input>Label.
In semi-automatic learning process, machine simultaneously need not recognize when webpage loads completion, but by artificially carrying out Judge.
When seeing that markup language processing unit indicates a need for the list of fill substance, the markup language of form structure is presented Set has completely been present in markup language processing unit and suffered.
Imparting indicia language processing apparatus, when any<input>When label is activated, notify what learning device was activated< input>The object of label, markup language processing unit, when parsing markup language, is the unique correspondence of each label generation Relation entrance, learning device is by having activated<input>Label object, reads the attribute of this label, learning device by time The markup language set gone through in markup language processing unit, calculating has currently been activated<input>Label is in markup language set Absolute position, imparting indicia language processing apparatus, when producing list and submitting event, should not be committed to server, but logical Know that learning device list submits event to be produced by which object, in learning device, activation successively needs fill substance Input frame, during this, the input frame being activated will be recorded, and be not activated, reconditioning will be ignored, in study In device, submitting button is clicked on, list is produced and submits event, learning device is received after event, the input frame that will be recorded in upper step Information and the corresponding URL deposits form feature database of current markers file.
The selection of markup language processing unit is using the Trident engines of Microsoft, and its corresponding interface is WebControl Interface.
According to JavaScript language standard, when being clicked on using controller in markup language processing unit, it can produce A raw onClick event, when producing onClick events, can call a function, and the object that will trigger onClick Function is passed to by parameter, allows JavaScript language according to this event action object.
A JavaScript function is wherein write, this function can travel through the mark in current markers language processing apparatus always Object is signed, as shown in Fig. 3, and with oneself onClick processing function registration input labels, button labels, a labels, img The onClick events of label, so as to the HTML controls of dynamic load after handling, this function is responsible for collecting onClick processing letters The information that number is sent out is as shown in Figure 4.
When markup language processing unit confirmation flag language file, which is loaded, to be finished, what Trident engines were provided WebControl interfaces, can release DocumentCompleted events, learning device in making language document loaded The function of read control information is put into the mark of current markers file by the interface by handling markup language device itself offer In set.
When markup language processing unit confirmation flag language file, which is loaded, to be finished, learning device is by handling mark language The privately owned JavaScript interface that learning device is provided is registered to markup language processing unit by the interface that speech device itself is provided, This privately owned interface can make the JavaScript engine in markup language processing unit be communicated with learning device, privately owned be connect by this The label information being collected into can be sent to learning device by the JavaScript engine in mouth current markers file.
Learning device is received after final label information, write into Databasce.
It is described above, be only presently preferred embodiments of the present invention, any formal limitation not done to the present invention, it is every according to According to the present invention technical spirit above example is made any simple modification, equivalent variations, each fall within the present invention protection Within the scope of.

Claims (8)

1. a kind of form feature extracting method of semi-automatic learning type, it is characterised in that:Comprise the following steps:
(1) learning device, learning device built-in token language processing apparatus are started;
(2) in the position of address field input marking language file;
(3) learning device loads making language document by built-in browser;
(4) after the completion of loading, built-in browser notifies the loading of learning device making language document to complete, and generates markup language collection It is fit;
(5) learning device inserts study module in the making language document loaded;
(6) list is operated, by learning device complete documentation, and the characteristic information of correlation is generated;
(7) receive after submitting button click event, study module thinks that study is completed, and form structure information is stored in into database; In semi-automatic learning process, machine simultaneously need not recognize when webpage loads completion, but by artificially being judged;Work as mark When note language processing apparatus indicates a need for the list of fill substance, the markup language set that form structure is presented completely is deposited It is that markup language processing unit is suffered;Imparting indicia language processing apparatus, when any<input>When label is activated, notify What learning device was activated<input>The object of label, markup language processing unit, when parsing markup language, is each mark Label generation unique corresponding relation entrance, learning device is by having activated<input>Label object, reads the category of this label Property, learning device has currently been activated by traveling through the markup language set in markup language processing unit, calculating<input>Label Absolute position in markup language set, imparting indicia language processing apparatus, when producing list submission event, should not be submitted To server, but notify learning device list to submit event to be produced by which object, in learning device, activate successively The input frame of fill substance is needed, during this, the input frame being activated will be recorded, be not activated, reconditioning It will be ignored, and in learning device, click on submitting button, and produce list and submit event, learning device is received after event, by upper step The corresponding URL deposits form feature database of input frame information and current markers file of middle record;
(8) whole list feature learning process is completed.
2. a kind of form feature extracting method of semi-automatic learning type according to claim 1, it is characterised in that:It is described The markup language processing unit of entity built in learning device.
3. a kind of form feature extracting method of semi-automatic learning type according to claim 1, it is characterised in that:It is described The markup language processing unit of non-physical built in learning device.
4. a kind of form feature extracting method of semi-automatic learning type according to Claims 2 or 3, it is characterised in that: The markup language processing unit is provided with operate interface.
5. a kind of form feature extracting method of semi-automatic learning type according to Claims 2 or 3, it is characterised in that: The markup language processing unit default label language is HTML.
6. a kind of form feature extracting method of semi-automatic learning type according to claim 4, it is characterised in that:It is described Markup language processing unit is Trident engines, and the operate interface is WebControl interfaces.
7. a kind of form feature extracting method of semi-automatic learning type according to claim 1, it is characterised in that:It is described Built-in browser is IE browser.
8. a kind of form feature extracting method of semi-automatic learning type according to claim 1, it is characterised in that:It is described Markup language aggregate is JavaScript content for script.
CN201410317562.9A 2014-07-07 2014-07-07 A kind of form feature extracting method of semi-automatic learning type Active CN104063488B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410317562.9A CN104063488B (en) 2014-07-07 2014-07-07 A kind of form feature extracting method of semi-automatic learning type

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410317562.9A CN104063488B (en) 2014-07-07 2014-07-07 A kind of form feature extracting method of semi-automatic learning type

Publications (2)

Publication Number Publication Date
CN104063488A CN104063488A (en) 2014-09-24
CN104063488B true CN104063488B (en) 2017-09-01

Family

ID=51551202

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410317562.9A Active CN104063488B (en) 2014-07-07 2014-07-07 A kind of form feature extracting method of semi-automatic learning type

Country Status (1)

Country Link
CN (1) CN104063488B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109445654B (en) * 2018-09-28 2022-02-08 成都安恒信息技术有限公司 Method for automatically filling gaps in graphical interface program
CN112836150A (en) * 2021-02-03 2021-05-25 捷玛计算机信息技术(上海)股份有限公司 Identification method, system, equipment and medium for tracing code of medicine

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102681994A (en) * 2011-03-07 2012-09-19 北京百度网讯科技有限公司 Webpage information extracting method and system
CN103443786A (en) * 2011-03-15 2013-12-11 高通股份有限公司 Machine learning method to identify independent tasks for parallel layout in web browsers
CN103440198A (en) * 2013-08-27 2013-12-11 星云融创(北京)信息技术有限公司 Method for calibrating form
CN103514292A (en) * 2013-10-09 2014-01-15 南京大学 Webpage data extraction method based on semi-supervised learning of small sample
CN103559234A (en) * 2013-10-24 2014-02-05 北京邮电大学 System and method for automated semantic annotation of RESTful Web services
CN103699683A (en) * 2014-01-02 2014-04-02 国家电网公司 Data processing method and data processing device
CN103793282A (en) * 2012-11-02 2014-05-14 阿里巴巴集团控股有限公司 Browser and tab ending method thereof

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9460064B2 (en) * 2006-05-18 2016-10-04 Oracle International Corporation Efficient piece-wise updates of binary encoded XML data
US20170147577A9 (en) * 2009-09-30 2017-05-25 Gennady LAPIR Method and system for extraction

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102681994A (en) * 2011-03-07 2012-09-19 北京百度网讯科技有限公司 Webpage information extracting method and system
CN103443786A (en) * 2011-03-15 2013-12-11 高通股份有限公司 Machine learning method to identify independent tasks for parallel layout in web browsers
CN103793282A (en) * 2012-11-02 2014-05-14 阿里巴巴集团控股有限公司 Browser and tab ending method thereof
CN103440198A (en) * 2013-08-27 2013-12-11 星云融创(北京)信息技术有限公司 Method for calibrating form
CN103514292A (en) * 2013-10-09 2014-01-15 南京大学 Webpage data extraction method based on semi-supervised learning of small sample
CN103559234A (en) * 2013-10-24 2014-02-05 北京邮电大学 System and method for automated semantic annotation of RESTful Web services
CN103699683A (en) * 2014-01-02 2014-04-02 国家电网公司 Data processing method and data processing device

Also Published As

Publication number Publication date
CN104063488A (en) 2014-09-24

Similar Documents

Publication Publication Date Title
US11294968B2 (en) Combining website characteristics in an automatically generated website
CN101211364B (en) Method and system for social bookmarking of resources exposed in web pages
US8468145B2 (en) Indexing of URLs with fragments
CN104158836A (en) Method for rendering mobile application interface through data
US20150302110A1 (en) Decoupling front end and back end pages using tags
CN106598991A (en) Web crawler system capable of realizing website interaction and automatic form extraction by conversational mode
CN104881488A (en) Relational table-based extraction method of configurable information
CN102831345A (en) Injection point extracting method in SQL (Structured Query Language) injection vulnerability detection
CN101261669A (en) A method for visual validation system based on mouse operation
CN106570750A (en) Browser plug-in-based automatic tax declaration method, system and browser plug-in
CN102523106A (en) Video website user behavior analysis system based on Flex RIA (Rich Internet Applications) technology
CN108804469A (en) A kind of web page identification method and electronic equipment
JP2017027208A (en) Dialogue information providing system, information processing unit and program
JP4460620B2 (en) Information service providing method and server
CN104063488B (en) A kind of form feature extracting method of semi-automatic learning type
CN104598348B (en) A kind of method and system of the long-range external system interface performance of analysis in real time
CN114398138A (en) Interface generation method and device, computer equipment and storage medium
CN104156394B (en) mobile page creation system and method
CN109240664A (en) A kind of method and terminal acquiring user behavior information
WO2023155274A1 (en) Recruitment information publishing method and apparatus based on rpa and ai
CN104471531A (en) Capturing an application state in a conversation
JP5497925B2 (en) Content management apparatus, content management method and program
CN116166533A (en) Interface testing method, device, terminal equipment and storage medium
CN105874470A (en) Interactive optical codes
CN104866532B (en) A kind of method and apparatus for the data search under semiclosed data environment

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant