CN104063488B - A kind of form feature extracting method of semi-automatic learning type - Google Patents
A kind of form feature extracting method of semi-automatic learning type Download PDFInfo
- Publication number
- CN104063488B CN104063488B CN201410317562.9A CN201410317562A CN104063488B CN 104063488 B CN104063488 B CN 104063488B CN 201410317562 A CN201410317562 A CN 201410317562A CN 104063488 B CN104063488 B CN 104063488B
- Authority
- CN
- China
- Prior art keywords
- markup language
- learning device
- language processing
- semi
- processing unit
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/957—Browsing optimisation, e.g. caching or content distillation
- G06F16/9577—Optimising the visualization of content, e.g. distillation of HTML documents
Abstract
The invention discloses a kind of form feature extracting method of semi-automatic learning type, comprise the following steps:(1)Start learning device;(2)The position of input marking language file;(3)Learning device loads making language document;(4)Generate markup language aggregate;(5)In study module insertion making language document;(6)List is operated, complete documentation generates characteristic information;(7)Form structure information is stored in database;(8)Form feature study is completed.The method of the invention be able to can be extracted with integrality, authenticity, the web form architectural feature of accuracy by way of manually participating in, with semi-automatic machine learning markup language form structure;Submission form is completed by learning device, and form feature extracts and is difficult failure;Make<input>Input frame quilt<form>Label is wrapped up, so that browser, which is sent after webpage loaded is notified, can meet the rule of static scanning, can be well on inquiry.
Description
Technical field
The present invention relates to machine learning, data mining, online experience field, a kind of semi-automatic learning type is specifically referred to
Form feature extracting method.
Background technology
With the popularization of Internet information technique and popular, by browser access retrieved web information with exchange
As one of required skill for improving modern society's productivity.
When accessing retrieved web information, it may be necessary to frequently input information to website, such as:User logs in, deliver and comment
By, take part in a vote, some information need repeat and frequently enter, such as:User logs in, in different websites it is necessary to defeated
Enter the information such as different user name or password;And shopping online, buy different commodity it is necessary to repeatedly input oneself address,
The information such as postcode, consignee's name.
Because these information may need frequent, substantial amounts of input, and information has unicity, such as shopping online, from
Oneself address generally will not often change, and name is even more so, so outside almost all of modern markup language processing unit
The Man Machine Interface of shell, i.e. markup language processing unit, such as browser interface are filled out there is provided automated log on and list automatic generation
Function, mitigates the duplication of labour burden of the mankind, improves production efficiency.
If markup language processing unit shell is needed data Auto-writing to the list in markup language processing unit
In, it must be understood that the list project corresponding to relevant entry, such as:Addressee's name correspondence the 1st input frame, address of the addressee pair
Answer the 2nd input frame, addressee's postcode the 3rd input frame of correspondence., just must be it is to be understood that the structure of list be special under such rule
Levy, correctly could fill in data in corresponding project.
The HTML that World Wide Web Consortium is proposed, i.e. HTML, referred to as " markup language ", language standard makes internet
The web page files that can be made up of unified, standardization language generation by marking, referred to as " tab file ".Html language is to set
There is provided a series of standard base part on the basis of the label of shape structure, as long as markup language processing unit realizes that HTML is marked
It is accurate, it is possible to keep versatility.
When loading the making language document of website using markup language processing unit, if necessary to submit number to website
According to, such as chat, make comments, buy and sell commodity, preserve customized information, website must just provide collection browser data collection
The approach of data, " list is provided for this html language standard(form)" part, a list generally comprises following element:<
form>:It is a list to state this, and the data among this can be submitted to server;<input>:<form>The son section of label
Point, it is a single file text input frame to state this, according to type attributes, can show different patterns, such as:<input type=
text>, a common input frame;<input type=password>, one conceal input content Password Input frame;
Submission form button:Submission form is actually<input>One type attribute of label, when<input>The type attribute quilts of label
When being set to submit, a button can be showed in markup language processing unit, can be by when button is activated<form>
It is all legal in label<input>The data of user's input are all submitted to server.
Existing characteristic analysis method, as shown in figure 1, leading to whenever markup language processing unit sends tab file loaded
When knowing, it is assumed that the page occurs the content for including above element, then the interface provided by markup language processing unit to mark
Note file is analyzed, and takes out list<from><input>Feature, but such a method is in the dynamically labeled loading of high speed development
Seem before technological side unable to do what one wishes, because dynamically labeled loading technique can cause problems with:
Markup language processing unit is sent after webpage loaded notice, does not have the content of login frame in tab file,
And the markup language required for list is presented actually is continuing loading by the JavaScript scripts in tab file, also
It is to say, the markup language set required for list is now presented does not have real loading and completed, and can be lost so form feature is extracted
Lose;
Submitting button is not<input type=submit>, it may be possible to any one, which is added, calls JavaScript
The html tag of scripted code, and submission form is completed by JavaScript scripts, can be lost so form feature is extracted
Lose;
Even<input>Input frame does not have quilt yet<form>Label is wrapped up.This, which results in browser and sends webpage, adds
Load can not meet the rule of static scanning after finishing notice, cause inquiry to fail.
The content of the invention
It is an object of the invention to by way of manually participating in there is provided one kind can further extract with integrality,
The form feature extracting method of authenticity, the semi-automatic learning type of the web form architectural feature of accuracy.
The present invention is achieved through the following technical solutions a kind of form feature extracting method of semi-automatic learning type, including with
Lower step:
(1)Start learning device, learning device built-in token language processing apparatus;
(2)In the position of address field input marking language file;
(3)Learning device loads making language document by built-in browser;
(4)After the completion of loading, built-in browser notifies the loading of learning device making language document to complete, and generates mark language
Say aggregate;
(5)Learning device inserts study module in the making language document loaded;
(6)List is operated, by learning device complete documentation, and the characteristic information of correlation is generated;
(7)Receive after submitting button click event, study module thinks that study is completed, and form structure information is stored in into data
Storehouse;
(8)Whole list feature learning process is completed.
The above method handles the learning device of markup language device by manufacturing built in one, determine markup language, marks
The label that input frame is presented in language processing apparatus selection is defaulted as<input>Label.
In semi-automatic learning process, machine simultaneously need not recognize when webpage loads completion, but artificially be sentenced
It is disconnected.
When seeing that markup language processing unit indicates a need for the list of fill substance, the label language of form structure is presented
Speech set has necessarily completely been present in markup language processing unit and suffered.
Imparting indicia language processing apparatus, when any<input>When label is activated, notify what learning device was activated<
input>The object of label.
Learning device is by having activated<input>Label object, reads the attribute of this label.
Learning device has currently been activated by traveling through the markup language set in markup language processing unit, calculating<input
>Absolute position of the label in markup language set.
Imparting indicia language processing apparatus, when producing list submission event, should not be committed to server, but notify to learn
Practising device list submits event to be produced by which object.
In learning device, activation successively needs the input frame of fill substance, during this, and the input frame being activated will
It is recorded, was not activated, reconditioning will be ignored.Submitting button is clicked on, list is produced and submits event, learning device is received
To after event, the input frame information recorded in upper step and the corresponding URL of current markers file are stored in form feature database.
So far, study is completed.
Learning device can be interacted by this part with markup language processing unit, learn web form feature, and
It is stored in form feature database.
No matter which kind of engine, be worth, will finally be integrated into service environment to its performance, therefore, engine can be external
Offer enables third party device to operate the operate interface of oneself.
According to JavaScript language standard, when being clicked on using controller in markup language processing unit, it can produce
A raw onClick event.
According to JavaScript language standard, when producing onClick events, a function can be called, and will triggering
OnClick object passes to function by parameter, allows JavaScript language according to this event action object.
A JavaScript function is write, this function can travel through the label pair in current markers language processing apparatus always
As, and with oneself onClick processing function registration input labels, button labels, a labels, img labels onClick things
Part, so as to the HTML controls of dynamic load after handling.
A JavaScript function is write, this function is responsible for collecting the information that onClick processing functions are sent out.
When markup language processing unit confirmation flag language file, which is loaded, to be finished, learning device is by handling mark language
The privately owned JavaScript interface that learning device is provided is registered to markup language processing unit by the interface that speech device itself is provided,
This privately owned interface can make the JavaScript engine in markup language processing unit be communicated with learning device, privately owned be connect by this
The label information being collected into can be sent to learning device by the JavaScript engine in mouth current markers file.
Further, the markup language processing unit of entity built in the learning device.
Further, the markup language processing unit of non-physical built in the learning device.
Further, the markup language processing unit is provided with operate interface.
Further, the markup language processing unit default label language is HTML.
Further, the markup language processing unit is Trident engines, and the operate interface connects for WebControl
Mouthful.Had the markup language processing units of many maturations at present, these devices include Microsoft Trident engines,
The Blink engines of Google companies, the Gecko engines of Mozilla foundations, the WebKit engines of Apple Inc. and other phases
The privately owned entity or virtual engine of Guan Hangye companies, and different markup language processing units is provided with corresponding interface, plants class name
Title is various, and preferred markup language processing unit is Trident engines here, and its interface is corresponding WebControl interfaces.
Further, the built-in browser is IE browser.
Further, the markup language aggregate is JavaScript content for script.
The present invention compared with prior art, with advantages below and beneficial effect:
(1)The method of the invention can be by way of manually participating in, with semi-automatic machine learning markup language table
Single structure, can be extracted with integrality, authenticity, the web form architectural feature of accuracy;
(2)Submitting button used in the method for the invention is<input type=submit>, submission form is by study dress
Completion is put, form feature extracts and is difficult failure;
(3)The method of the invention makes<input>Input frame<form>Label is wrapped up, so that browser sends webpage
Loaded can meet the rule of static scanning after notifying, can be well on inquiry.
Brief description of the drawings
Fig. 1 is markup language processing unit workflow;
Fig. 2 is the learning device workflow with markup language learning device;
Fig. 3 is markup language COLLECTION TRAVERSALSThe function flow;
Fig. 4 is " click " event handling function flow.
Embodiment
The present invention is described in further detail with reference to embodiment, but the implementation of the present invention is not limited to this.
Embodiment:
Existing characteristic analysis method, as shown in figure 1, leading to whenever markup language processing unit sends tab file loaded
When knowing, it is assumed that the page occurs the content for including above element, then the interface provided by markup language processing unit to mark
Note file is analyzed, and takes out list<from><input>Feature, but such a method is in the dynamically labeled loading of high speed development
Seem before technological side unable to do what one wishes.
Present embodiment discloses a kind of form feature extracting method of semi-automatic learning type, this method can be by that can lead to
The mode manually participated in is crossed, with semi-automatic machine learning markup language form structure, can be extracted with integrality, truly
Property, the web form architectural feature of accuracy.Specific implementation step is:
(1)Start learning device, can be appreciated that the human-computer interaction interface of a similar IE browser;
(2)In address field input marking language file, telltale mark language file position;
(3)Device loads making language document by built-in IE browser;
(4)After the completion of, built-in IE browser notifies the loading of learning device making language document to complete, and has generated mark
Remember language aggregate;
(5)Learning device inserts study module in the making language document loaded;
(6)Operate list, such as fill substance, choose an option, click on submitting button, these processes will by study fill
Put complete documentation, or and generate the related characteristic information such as tag name, attribute, absolute position;
(7)Receive after submitting button click event, study module thinks that study is completed, and the characteristic information of form structure is deposited
Enter database.Whole list feature learning process is completed.
The learning device workflow of markup language learning device is wherein carried, as shown in Fig. 2 default label language is
HTML, the label that input frame is presented in the selection of markup language processing unit is defaulted as<input>Label.
In semi-automatic learning process, machine simultaneously need not recognize when webpage loads completion, but by artificially carrying out
Judge.
When seeing that markup language processing unit indicates a need for the list of fill substance, the markup language of form structure is presented
Set has completely been present in markup language processing unit and suffered.
Imparting indicia language processing apparatus, when any<input>When label is activated, notify what learning device was activated<
input>The object of label, markup language processing unit, when parsing markup language, is the unique correspondence of each label generation
Relation entrance, learning device is by having activated<input>Label object, reads the attribute of this label, learning device by time
The markup language set gone through in markup language processing unit, calculating has currently been activated<input>Label is in markup language set
Absolute position, imparting indicia language processing apparatus, when producing list and submitting event, should not be committed to server, but logical
Know that learning device list submits event to be produced by which object, in learning device, activation successively needs fill substance
Input frame, during this, the input frame being activated will be recorded, and be not activated, reconditioning will be ignored, in study
In device, submitting button is clicked on, list is produced and submits event, learning device is received after event, the input frame that will be recorded in upper step
Information and the corresponding URL deposits form feature database of current markers file.
The selection of markup language processing unit is using the Trident engines of Microsoft, and its corresponding interface is WebControl
Interface.
According to JavaScript language standard, when being clicked on using controller in markup language processing unit, it can produce
A raw onClick event, when producing onClick events, can call a function, and the object that will trigger onClick
Function is passed to by parameter, allows JavaScript language according to this event action object.
A JavaScript function is wherein write, this function can travel through the mark in current markers language processing apparatus always
Object is signed, as shown in Fig. 3, and with oneself onClick processing function registration input labels, button labels, a labels, img
The onClick events of label, so as to the HTML controls of dynamic load after handling, this function is responsible for collecting onClick processing letters
The information that number is sent out is as shown in Figure 4.
When markup language processing unit confirmation flag language file, which is loaded, to be finished, what Trident engines were provided
WebControl interfaces, can release DocumentCompleted events, learning device in making language document loaded
The function of read control information is put into the mark of current markers file by the interface by handling markup language device itself offer
In set.
When markup language processing unit confirmation flag language file, which is loaded, to be finished, learning device is by handling mark language
The privately owned JavaScript interface that learning device is provided is registered to markup language processing unit by the interface that speech device itself is provided,
This privately owned interface can make the JavaScript engine in markup language processing unit be communicated with learning device, privately owned be connect by this
The label information being collected into can be sent to learning device by the JavaScript engine in mouth current markers file.
Learning device is received after final label information, write into Databasce.
It is described above, be only presently preferred embodiments of the present invention, any formal limitation not done to the present invention, it is every according to
According to the present invention technical spirit above example is made any simple modification, equivalent variations, each fall within the present invention protection
Within the scope of.
Claims (8)
1. a kind of form feature extracting method of semi-automatic learning type, it is characterised in that:Comprise the following steps:
(1) learning device, learning device built-in token language processing apparatus are started;
(2) in the position of address field input marking language file;
(3) learning device loads making language document by built-in browser;
(4) after the completion of loading, built-in browser notifies the loading of learning device making language document to complete, and generates markup language collection
It is fit;
(5) learning device inserts study module in the making language document loaded;
(6) list is operated, by learning device complete documentation, and the characteristic information of correlation is generated;
(7) receive after submitting button click event, study module thinks that study is completed, and form structure information is stored in into database;
In semi-automatic learning process, machine simultaneously need not recognize when webpage loads completion, but by artificially being judged;Work as mark
When note language processing apparatus indicates a need for the list of fill substance, the markup language set that form structure is presented completely is deposited
It is that markup language processing unit is suffered;Imparting indicia language processing apparatus, when any<input>When label is activated, notify
What learning device was activated<input>The object of label, markup language processing unit, when parsing markup language, is each mark
Label generation unique corresponding relation entrance, learning device is by having activated<input>Label object, reads the category of this label
Property, learning device has currently been activated by traveling through the markup language set in markup language processing unit, calculating<input>Label
Absolute position in markup language set, imparting indicia language processing apparatus, when producing list submission event, should not be submitted
To server, but notify learning device list to submit event to be produced by which object, in learning device, activate successively
The input frame of fill substance is needed, during this, the input frame being activated will be recorded, be not activated, reconditioning
It will be ignored, and in learning device, click on submitting button, and produce list and submit event, learning device is received after event, by upper step
The corresponding URL deposits form feature database of input frame information and current markers file of middle record;
(8) whole list feature learning process is completed.
2. a kind of form feature extracting method of semi-automatic learning type according to claim 1, it is characterised in that:It is described
The markup language processing unit of entity built in learning device.
3. a kind of form feature extracting method of semi-automatic learning type according to claim 1, it is characterised in that:It is described
The markup language processing unit of non-physical built in learning device.
4. a kind of form feature extracting method of semi-automatic learning type according to Claims 2 or 3, it is characterised in that:
The markup language processing unit is provided with operate interface.
5. a kind of form feature extracting method of semi-automatic learning type according to Claims 2 or 3, it is characterised in that:
The markup language processing unit default label language is HTML.
6. a kind of form feature extracting method of semi-automatic learning type according to claim 4, it is characterised in that:It is described
Markup language processing unit is Trident engines, and the operate interface is WebControl interfaces.
7. a kind of form feature extracting method of semi-automatic learning type according to claim 1, it is characterised in that:It is described
Built-in browser is IE browser.
8. a kind of form feature extracting method of semi-automatic learning type according to claim 1, it is characterised in that:It is described
Markup language aggregate is JavaScript content for script.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410317562.9A CN104063488B (en) | 2014-07-07 | 2014-07-07 | A kind of form feature extracting method of semi-automatic learning type |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410317562.9A CN104063488B (en) | 2014-07-07 | 2014-07-07 | A kind of form feature extracting method of semi-automatic learning type |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104063488A CN104063488A (en) | 2014-09-24 |
CN104063488B true CN104063488B (en) | 2017-09-01 |
Family
ID=51551202
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410317562.9A Active CN104063488B (en) | 2014-07-07 | 2014-07-07 | A kind of form feature extracting method of semi-automatic learning type |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104063488B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109445654B (en) * | 2018-09-28 | 2022-02-08 | 成都安恒信息技术有限公司 | Method for automatically filling gaps in graphical interface program |
CN112836150A (en) * | 2021-02-03 | 2021-05-25 | 捷玛计算机信息技术(上海)股份有限公司 | Identification method, system, equipment and medium for tracing code of medicine |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102681994A (en) * | 2011-03-07 | 2012-09-19 | 北京百度网讯科技有限公司 | Webpage information extracting method and system |
CN103443786A (en) * | 2011-03-15 | 2013-12-11 | 高通股份有限公司 | Machine learning method to identify independent tasks for parallel layout in web browsers |
CN103440198A (en) * | 2013-08-27 | 2013-12-11 | 星云融创(北京)信息技术有限公司 | Method for calibrating form |
CN103514292A (en) * | 2013-10-09 | 2014-01-15 | 南京大学 | Webpage data extraction method based on semi-supervised learning of small sample |
CN103559234A (en) * | 2013-10-24 | 2014-02-05 | 北京邮电大学 | System and method for automated semantic annotation of RESTful Web services |
CN103699683A (en) * | 2014-01-02 | 2014-04-02 | 国家电网公司 | Data processing method and data processing device |
CN103793282A (en) * | 2012-11-02 | 2014-05-14 | 阿里巴巴集团控股有限公司 | Browser and tab ending method thereof |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9460064B2 (en) * | 2006-05-18 | 2016-10-04 | Oracle International Corporation | Efficient piece-wise updates of binary encoded XML data |
US20170147577A9 (en) * | 2009-09-30 | 2017-05-25 | Gennady LAPIR | Method and system for extraction |
-
2014
- 2014-07-07 CN CN201410317562.9A patent/CN104063488B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102681994A (en) * | 2011-03-07 | 2012-09-19 | 北京百度网讯科技有限公司 | Webpage information extracting method and system |
CN103443786A (en) * | 2011-03-15 | 2013-12-11 | 高通股份有限公司 | Machine learning method to identify independent tasks for parallel layout in web browsers |
CN103793282A (en) * | 2012-11-02 | 2014-05-14 | 阿里巴巴集团控股有限公司 | Browser and tab ending method thereof |
CN103440198A (en) * | 2013-08-27 | 2013-12-11 | 星云融创(北京)信息技术有限公司 | Method for calibrating form |
CN103514292A (en) * | 2013-10-09 | 2014-01-15 | 南京大学 | Webpage data extraction method based on semi-supervised learning of small sample |
CN103559234A (en) * | 2013-10-24 | 2014-02-05 | 北京邮电大学 | System and method for automated semantic annotation of RESTful Web services |
CN103699683A (en) * | 2014-01-02 | 2014-04-02 | 国家电网公司 | Data processing method and data processing device |
Also Published As
Publication number | Publication date |
---|---|
CN104063488A (en) | 2014-09-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11294968B2 (en) | Combining website characteristics in an automatically generated website | |
CN101211364B (en) | Method and system for social bookmarking of resources exposed in web pages | |
US8468145B2 (en) | Indexing of URLs with fragments | |
CN104158836A (en) | Method for rendering mobile application interface through data | |
US20150302110A1 (en) | Decoupling front end and back end pages using tags | |
CN106598991A (en) | Web crawler system capable of realizing website interaction and automatic form extraction by conversational mode | |
CN104881488A (en) | Relational table-based extraction method of configurable information | |
CN102831345A (en) | Injection point extracting method in SQL (Structured Query Language) injection vulnerability detection | |
CN101261669A (en) | A method for visual validation system based on mouse operation | |
CN106570750A (en) | Browser plug-in-based automatic tax declaration method, system and browser plug-in | |
CN102523106A (en) | Video website user behavior analysis system based on Flex RIA (Rich Internet Applications) technology | |
CN108804469A (en) | A kind of web page identification method and electronic equipment | |
JP2017027208A (en) | Dialogue information providing system, information processing unit and program | |
JP4460620B2 (en) | Information service providing method and server | |
CN104063488B (en) | A kind of form feature extracting method of semi-automatic learning type | |
CN104598348B (en) | A kind of method and system of the long-range external system interface performance of analysis in real time | |
CN114398138A (en) | Interface generation method and device, computer equipment and storage medium | |
CN104156394B (en) | mobile page creation system and method | |
CN109240664A (en) | A kind of method and terminal acquiring user behavior information | |
WO2023155274A1 (en) | Recruitment information publishing method and apparatus based on rpa and ai | |
CN104471531A (en) | Capturing an application state in a conversation | |
JP5497925B2 (en) | Content management apparatus, content management method and program | |
CN116166533A (en) | Interface testing method, device, terminal equipment and storage medium | |
CN105874470A (en) | Interactive optical codes | |
CN104866532B (en) | A kind of method and apparatus for the data search under semiclosed data environment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |