CN109597927A - Bidding related web page page info extracting method and system - Google Patents

Bidding related web page page info extracting method and system Download PDF

Info

Publication number
CN109597927A
CN109597927A CN201811481859.3A CN201811481859A CN109597927A CN 109597927 A CN109597927 A CN 109597927A CN 201811481859 A CN201811481859 A CN 201811481859A CN 109597927 A CN109597927 A CN 109597927A
Authority
CN
China
Prior art keywords
information
user
bid
bidding
web site
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811481859.3A
Other languages
Chinese (zh)
Other versions
CN109597927B (en
Inventor
李正军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guiyang High-Tech Digital Communication Information Co Ltd
Original Assignee
Guiyang High-Tech Digital Communication Information Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guiyang High-Tech Digital Communication Information Co Ltd filed Critical Guiyang High-Tech Digital Communication Information Co Ltd
Priority to CN201811481859.3A priority Critical patent/CN109597927B/en
Publication of CN109597927A publication Critical patent/CN109597927A/en
Application granted granted Critical
Publication of CN109597927B publication Critical patent/CN109597927B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The present invention relates to the network informations to obtain field, more particularly to a kind of bidding related web page page info extracting method and system, this method comprises the following steps: S1: the nodes of locations webpage where obtaining relevant information automatically according to the relevant keyword that calls for bid and get the bid in webpage;S2: the shared father node webpage of the keyword is found according to the nodes of locations webpage of acquisition;S3: judging whether the father node webpage obtained has been acquired, if father node webpage was not acquired, will judge to carry out information crawler according to its arrangement mode after the father node web page contents arrangement mode;S4: are carried out by storage and is shown for the information on bidding and acceptance of the bid information that crawl.This programme realizes information on bidding and the automatic of information of getting the bid crawls, and obtains information on bidding and acceptance of the bid information in time convenient for bidder.

Description

Bidding related web page page info extracting method and system
Technical field
The present invention relates to the network informations to obtain field, and in particular to a kind of bidding related web page page info extracting method And system.
Background technique
Invite and submit bids, be the dealing of the bulk supply tariff carried out under condition of market economy, engineering construction project have give out a contract for a project with It contracts and when the buying and offer of service item, a kind of used mode of doing business.Under this mode of doing business, usually Tenderer is used as by the buying side of project purchasing (buying of purchase, engineering including cargo given out a contract for a project and serviced), passes through publication The call for tender issues the letter that the modes such as invitation for bids issue bid and purchase to a certain number of specific suppliers, contractor Breath, propose needed for purchase project property and its quantity, quality, technical requirements, delivery date, the time of completion or provide service when Between and other suppliers, contractor the bid and purchase condition such as qualifying requirements, show to select to be best able to meet purchase requirement Supplier, contractor sign the intention of procurement contract therewith, cargo, engineering or the report of service needed for being purchased by each intentional offer The condition of valence and other response requirements of invitation for bid, participates in bidding competition.Through tenderer to the quotation of each bidder and other conditions After being examined relatively, winning bidder is therefrom preferentially selected, and sign procurement contract with it.
Information-based development brings the new situation in bidding field, and originally bidder mainly passes through periodicals and magazines and obtains item The mode of mesh bidding information is transformed to through internet site the information for obtaining and oneself being suitble to submit a tender.One kind of bidder Way is each bidding website of login various regions to obtain information, and then being retrieved and being checked one by one by artificial mode needs The information wanted.Another more efficient way is the bidding information site for logging in some large sizes, passes through full-text search The bidding information that mode removal search needs, whole process expend time-consuming and laborious, and acquisition of information asking not in time usually occur Topic;When information on bidding or acceptance of the bid information issue out after need bidder go click check, if but bidder it is more when, may It will lead to bid enterprise web site collapse, can not timely obtain acceptance of the bid information.
Summary of the invention
It is an object of that present invention to provide a kind of bidding related web page page info extracting methods, to solve existing bidder The problem of finding information on bidding or acceptance of the bid acquisition of information not in time.
Base case provided by the invention are as follows: bidding related web page page info extracting method includes the following steps:
S1: the nodes of locations net where obtaining relevant information automatically according to the relevant keyword that calls for bid and get the bid in webpage Page;
S2: the shared father node webpage of the keyword is found according to the nodes of locations webpage of acquisition;
S3: judging whether the father node webpage obtained has been acquired, if father node webpage was not acquired, will sentence Break and information crawler is carried out according to its arrangement mode after the father node web page contents arrangement mode;
S4: are carried out by storage and is shown for the information on bidding and acceptance of the bid information that crawl.
The present invention has the advantages that
1, when bidder pays close attention to the information on bidding and acceptance of the bid information of many enterprises website, bidder is without entering each family The website of enterprise information publishing up see, only need to by this programme obtain institute acquisition in need information on bidding and get the bid information be It can;
If the website concern number that 2, bid company information is announced is more, net may result in after enterprise release information The case where page collapse can not load, is just not necessarily in the website for entering corresponding enterprise information publishing using bidder when this programme, Information on bidding need to be only obtained automatically using this method or it is more convenient to obtain information compared with artificial obtain for acceptance of the bid information.
Further, in step S1-S4, the crawl of information on bidding and acceptance of the bid information is quasi- according to distribution by director server Then distribute what corresponding information scratching child servers were completed.
Since the website that the company information for needing to crawl is announced is relatively more, so being climbed carrying out information on bidding and acceptance of the bid information Involved website is relatively more when taking, and the required information scratching child servers used are also just relatively more, and director server root Information scratching child servers execution information is distributed according to allocation criteria and crawls work, avoids information leakage crawl or repeats to grab.
Further, when allocation criteria generates, the renewal time of site information first on acquisition enterprise web site, then to each The network information update time is successively sorted on a enterprise web site, while according to the average daily access of each enterprise web site Number is ranked up, preferentially more to averagely daily the number of visiting people if the network information update time is identical on multiple enterprise web sites The preferential execution information fetching instruction of enterprise web site, if each enterprise web site more New Network Information time is different, according to enterprise The sequencing execution information fetching instruction of industry network upgrade time, and each station information crawl child servers will be by according to execution The information scratching instruction time is successively arranged to execute.
The network information update time may be not consistent on enterprise web site, so by the enterprise web site more New Network Information time As one of allocation criteria factor, convenient for can timely obtain the information that corresponding enterprise web site updates.And each enterprise network Standing, average daily the number of visiting people is not consistent, and the information of the update on enterprise web site more than the number of visiting people is more difficult to obtain, so Enterprise web site is averaged daily the number of visiting people as one of factor of allocation criteria, in multiple enterprise web sites more new information at the same time Shi Youxian obtains the information that the enterprise web site more than average daily the number of visiting people updates, the acquisition of information for avoiding the enterprise web site from updating It lags too long.
Further, in step s3, after carrying out information on bidding and acceptance of the bid information crawler, the bid crawled will be randomly selected and believed Breath and acceptance of the bid information carry out verifying of correcting errors.
It crawls the information of acquisition by randomly selecting and carries out verifying of correcting errors, crawl information convenient for tentatively grasping and correct errors rate, i.e., The optimization of crawling method is carried out convenient for system testers.
In addition, be directed to above-mentioned bidding related web page page info extracting method, additionally provide a kind of using this method Bidding related web page page info extraction system, comprising: user terminal, director server and multiple information scratching sub-services Device;
User terminal is for user's registration, login, concern and subscribes to information on bidding and acceptance of the bid information;
Director server is used to generate the allocation criteria of information scratching child servers, then distributes information according to the allocation criteria Crawl child servers carry out the crawl of corresponding enterprise web site Tender Based information and information of getting the bid.
Information on bidding will be realized using this system and information of getting the bid automatically grabs, need to check different enterprise biddings The bidder of information and acceptance of the bid information can uniformly be checked by the system, without going on bidder to each enterprise web site It finds, it is easy to use.
Further, the director server includes that user's classification limits module, and the user, which classifies, limits module for note The user of volume carries out delineation of power, is divided into ordinary user, document human user and system testers user, ordinary user's purchase It can be carried out access information reading after buying member, document human user can not only read access information, moreover it is possible to believe access Breath is write, and system testers user can carry out information reading, write and software test.
The permission of registration user is divided, the management work of system is easy to implement.
Detailed description of the invention
Fig. 1 is the logic diagram of bidding related web page page info extraction system in the embodiment of the present invention one.
Specific embodiment
It is further described below by specific embodiment:
As shown in Figure 1, bidding related web page page info extraction system includes: information management subsystem and distribution model Subsystem is generated, wherein information management subsystem includes: user terminal, director server and multiple information scratching child servers, is used Module carries out wireless communication by wireless communication with director server for family terminal and information scratching child servers.
One, user terminal
Login Register module is registered or is logged according to registration information or log-on message for different users, user Including ordinary user and management user, management user includes system testers user and document human user.
Account setup module carries out filling in setting for personal information of the user to oneself.
Membership buying module buys member for ordinary user.
Setup module carries out password modification and problem feedback for user.
Information on bidding search module carries out information on bidding search for user and checks.
Information on bidding checks module, for different types of information on bidding to be checked, paid close attention to and subscribed to.
Information authentication module checks the information on bidding and acceptance of the bid information progress that director server is sent for document human user Verification.
Two, director server
Database, databases contain the father node webpage of information scratching child servers crawl.
User, which classifies, limits module, for carrying out delineation of power to the different user of registration, after ordinary user buys member It can be carried out access information reading, document human user can not only read access information, moreover it is possible to compile to access information It writes, system testers user can carry out information reading, write and software test.
Distribution model in the present embodiment generates subsystem and is located in director server, and it includes son that distribution model, which generates subsystem, Server distribution module, child servers distribution module are used to generate the allocation criteria of information scratching child servers, and according to distribution Criterion distributes different information scratching child servers according to Distribution dynamics execution information fetching instruction.When allocation criteria generates, first The renewal time of site information on enterprise web site is obtained, then the network information update time on each enterprise web site is carried out first After sort, while being ranked up according to the average daily the number of visiting people of each enterprise web site, if network on multiple enterprise web sites The information update time is identical, then preferentially to the preferential execution information fetching instruction of enterprise web site more than averagely daily the number of visiting people, if Each enterprise web site more New Network Information time is different, then grabs according to the sequencing execution information of enterprise web site renewal time Instruction fetch, and each station information crawl child servers will be successively arranged to execute according to the execution information fetching instruction time.
Authentication module is extracted for receiving key value judgment module acceptance of the bid information or information on bidding, is then randomly selected pre- If the information on bidding of quantity is sent to the corresponding user terminal of document human user with acceptance of the bid information.
Three, information scratching child servers
Information scratching module carries out the crawl of website bid acceptance of the bid information for the distribution according to director server, carries out net When the crawl for information of standing, then load bid acceptance of the bid html page info is obtained automatically according to the keywords such as " bid ", " acceptance of the bid " Nodes of locations webpage in webpage where relevant information searches shared nearest of the keyword according to the nodes of locations of acquisition later Father node webpage.If not finding the shared nearest father node webpage of keyword, the node web page being originally taken such as obtained It has been homepage, has then no longer obtained nearest father node webpage, has been executed the node web page being originally taken as father node webpage.
Key value judgment module, for judging whether the father node webpage of information scratching module crawl is stored on number According in library, if being stored in database, the corresponding information on bidding of father node webpage or acceptance of the bid information are obtained, if the father Node web page does not store in the database, then judges that content arrangement rule is vertical arrangement or cross in the father node webpage To arrangement, if judging result is laterally arrangement, the corresponding information on bidding of father node webpage or acceptance of the bid information are laterally obtained, if Judging result is longitudinal arrangement, then longitudinal to obtain the corresponding information on bidding of father node webpage or acceptance of the bid information.In addition, keyword Judgment module sends its information to director server after obtaining information on bidding or acceptance of the bid information.
For above-mentioned bidding related web page page info extraction system, this programme also discloses a kind of bidding associated nets Page page info extracting method, implementing procedure are as follows:
S1: the nodes of locations net where obtaining relevant information automatically according to the relevant keyword that calls for bid and get the bid in webpage Page;
S2: the shared father node webpage of the keyword is found according to the nodes of locations webpage of acquisition;
S3: judging whether the father node webpage obtained has been acquired, if father node webpage was not acquired, will sentence Break and information crawler is carried out according to its arrangement mode after the father node web page contents arrangement mode;
S4: are carried out by storage and is shown for the information on bidding and acceptance of the bid information that crawl.
Wherein, in step S1-S4, the crawl of information on bidding and acceptance of the bid information is by director server according to allocation criteria Distribute what corresponding information scratching child servers were completed.When allocation criteria generates, site information first on acquisition enterprise web site Then renewal time successively sorts to the network information update time on each enterprise web site, while according to each enterprise The average daily the number of visiting people in industry website is ranked up, if the network information update time is identical on multiple enterprise web sites, preferentially To the preferential execution information fetching instruction of enterprise web site more than averagely daily the number of visiting people, if each enterprise web site updates network letter Time difference is ceased, then according to the sequencing execution information fetching instruction of enterprise web site renewal time, and each station information grabs Child servers will be successively arranged to execute according to the execution information fetching instruction time.In step s3, carry out information on bidding and It gets the bid after information crawler, the information on bidding crawled will be randomly selected and acceptance of the bid information carries out verifying of correcting errors.
Embodiment two
Embodiment two and the difference of embodiment one are that it includes: user's end that distribution model, which generates subsystem, in embodiment two End, management terminal and director server.User terminal and management terminal are and director server by existing WIFI module carries out net Network communication, user terminal and management terminal can select existing mobile phone or computer.Distribution model generates subsystem and information User terminal used in management subsystem the two is same equipment, and the director server that the two uses is same equipment.
One, user terminal
User terminal includes:
Concern demand fills in module, the information collection of oneself concern is inputted for user, and the information collection of user's input is sent out Give director server.Information collection includes that user wants the enterprise name of concern and subscription and the information content keyword of concern.
Two, director server
Director server includes:
Database, all data for generating and receiving for storing director server, and established for each user One subscriber information storing module.
Enterprise web site the number of visiting people obtains module, for from being obtained in the enterprise web site nearest 1 year on each enterprise web site Total the number of visiting people, its average daily the number of visiting people is then calculated according to total the number of visiting people in nearly 1 year of corresponding enterprise web site, it It is successively sorted according to each average daily the number of visiting people of enterprise afterwards, generates the average daily the number of visiting people information list of enterprise web site.Enterprise The average daily the number of visiting people information list in website includes the average daily the number of visiting people of enterprise name and corresponding enterprise, more than average daily the number of visiting people Before enterprise web site comes, after the few enterprise web site of average daily the number of visiting people comes.
Enterprise web site the number of visiting people logging modle records in every enterprise web site every day in each hour for obtaining Then its number of visiting people amount generates the folding changed over time to every enterprise web site horal the number of visiting people every day respectively Then to the number of visiting people trough period changing rule, then line chart analyzes each enterprise web site the number of visiting people every day peak period Judging same enterprise web site, whether its number of visiting people peak period of same date is not consistent to the number of visiting people trough period changing rule, if It is consistent then generate the enterprise web site corresponding access time day record information, if same enterprise web site not same date its access people Number peak period is inconsistent to the number of visiting people trough period changing rule, then corresponding enterprise web site Monday to Sunday is analyzed as unit of week Between its same date the number of visiting people peak period and does not generate all access times record letter to the number of visiting people trough period changing rule Breath.Access time day record information includes enterprise web site the number of visiting people peak period and the trough period information in usual one day, is visited in week Ask that time record information includes daily the number of visiting people change information, daily enterprise web site the number of visiting people peak period and low ebb in one week Phase information, in one week Monday to Sunday its number of visiting people changing rule information and the number of visiting people peak period and trough period variation rule Restrain information.
Enterprise web site information announcement time-obtaining module updates information time for obtaining enterprise web site daily, leads to simultaneously Cross the keywords such as " bid " and " acceptance of the bid " and find the webpage information that corresponding enterprise web site is announced, then according to " announcement time " or Keywords such as " times of disclosure " crawl it and get the bid the time of disclosure, generate company information renewal time information later.In this programme Company information renewal time information includes that enterprise web site updates information time and acceptance of the bid public information time daily.Enterprise web site letter Breath announces time-obtaining module and is also used to update time shaft of information time as unit of by day daily according to different enterprise web sites On be labeled, the enterprise web site of more new information of same time is labeled in point at the same time, the acceptance of the bid time of disclosure then marks On calendar, the daily update information time of mark and acceptance of the bid time of disclosure are then generated into temporal information record sheet.
User information consults rule and obtains module, checks information rule record sheet for obtaining user.User is obtained to check When information rule record sheet, first obtains each user and logged in daily from Login Register module and check the time of information and check Then the time of corresponding content generates a user for each user and checks information rule record sheet, each user's User checks that information rule table includes: that (or user is accustomed to login time to daily login time rule, including user is first daily Secondary login system temporal regularity, user secondary login system temporal regularity and user's daily third time login system time daily Rule), check content, check that each Enterprise content corresponds to the time and checks the successive sequence of Enterprise content.
Distribution model generation module, for according to the average daily the number of visiting people information list of enterprise web site, access time day record letter Breath, record information of all access times, temporal information record sheet and user check that information rule table generates distribution model, and according to point Corresponding information scratching child servers execution information, which is distributed, with model crawls instruction.
When distribution model generates, three are carried out according to all enterprise web sites that company information renewal time information pays close attention to user The division of seed type.First seed type updates information time for enterprise web site daily and first logs into system time daily in user Before, which is that the enterprise web site updates information time daily and first logs into system daily to user During time;Second of type updates information time for enterprise web site daily and first logs into system time and user daily in user Between the daily secondary login system time, which is that the enterprise web site updates information time daily With during user's secondary login system time daily;Third seed type is that update information time daily daily in user for enterprise web site Secondary login system time and user's daily third time login system time, which is the enterprise Website updated information time and daily during user's daily third time login system time.
For belonging to the enterprise web site of same type, check that the successive sequence of Enterprise content successively compares user according to user Enterprise web site (the enterprise name and use that user information concentration is filled in that daily login time and its user pay close attention to and often browse through Family checks that is recorded in information rule table checks the corresponding enterprise web site of content) it corresponding access time day record information or visits in week Ask the time record information, and confirm enterprise web site update daily information time (including acceptance of the bid the public information time on the day of correspondence Enterprise web site update information time daily) to the number of visiting people trough period (the referred to as best letter between the daily login time of user Breath crawls the time), and the best information crawl arranged in the time corresponding information scratching child servers carry out information on bidding and Acceptance of the bid information crawls.
If it is identical that many enterprises' website best information of user's concern crawls time budget, and many enterprises website is corresponding The average daily the number of visiting people that the average daily the number of visiting people information list of enterprise web site obtains is all the same, then the elder generation of Enterprise content is checked according to user It sorts afterwards and idle information scratching child servers is successively arranged to be crawled;If many enterprises' website best information of user's concern Crawl that time budget is identical, and the average daily access that the average daily the number of visiting people information list of the corresponding enterprise web site in many enterprises website obtains Number is different, then the enterprise web site successive collating sequence (row more than the number of visiting people from enterprise web site average daily the number of visiting people information list Preceding) and user check Enterprise content successive sequence (user referring initially to enterprise come before) preferentially pacify in two collating sequences Row crawls user and checks that Enterprise content comes a most preceding enterprise web site, then arranges to crawl the average daily the number of visiting people of enterprise web site Enterprise web site arranges preceding enterprise web site in information list, arranges to crawl user later again and checks that Enterprise content comes deputy enterprise Then industry website arranges to crawl the enterprise web site that enterprise web site in the average daily the number of visiting people information list of enterprise web site comes second, with This analogizes (since same enterprise web site can have in two collating sequences, once an enterprise web site has been crawled, Sorting position in two collating sequences just fails, and crawling later just will not consider further that the enterprise web site).Such as: it is same Its best information of the enterprise web site of A, B, C, D tetra- of a user concern crawls the time in the same period, in enterprise web site day It puts in order in equal the number of visiting people information list as A-B-C-D;The arrangement of four enterprise web sites in information rule table is checked in user Sequence is C-A-D-E, the then corresponding enterprise web site information crawler of the C enterprise that gives priority in arranging for, followed by corresponding enterprise, arrangement A enterprise Website, arranges the corresponding enterprise web site of D enterprise later, finally arranges the corresponding enterprise web site of B enterprise.
In addition, if user have one day login system time and user check it is being recorded in information rule table and as, such as Usual user first logs into system time than later daily, and has urgent suddenly desired know that an enterprise is disclosed and recruit for one day Information or acceptance of the bid information are marked, so the time that this day user first logs into system is relatively habitually in the past all much earlier, and at this moment according to step Rapid one and step 2 rule, even if enterprise web site update information time before the user first logs into system, but due to enterprise Industry updates the number of visiting people low ebb determined in step 2 during information time first logs into system time to usual user daily Phase does not update information time in enterprise and first logs between system time to this day user, then will lead to user concern and normal The information on bidding or acceptance of the bid information announced in the enterprise web site often browsed are just there are no being crawled, at this moment, once on the day of user System time is first logged into before usual user first logs into system time daily, then it is first according to the usual user of the user Secondary login system time and user are checked habitually in the past between the information on bidding crawled on first enterprise web site or acceptance of the bid information time Time difference, while obtaining and first logging into system time on the day of the user and since first logging into system time on the day of user Arrangement information grabs child servers and crawls corresponding information on bidding or acceptance of the bid information from corresponding enterprise web site, i.e., in time difference It is the time for obtaining enterprise web site information on bidding and information of getting the bid for a period of time.When obtaining information on bidding or acceptance of the bid information, according to The user of the user checks the successive sequence for checking Enterprise content in information record sheet to each enterprise web site information of reply Crawling the time is ranked up, i.e., the information crawled on the enterprise web site that usual user first checks is also preferential point in this time crawls Information crawler is carried out with information scratching child servers.
Three, information scratching child servers
Information scratching child servers include:
Information crawler module, for receiving the information crawler instruction of director server transmission, then after receiving the instruction Information on bidding or information of getting the bid are carried out on to corresponding enterprise web site crawls work.
In addition, generating subsystem for above-mentioned distribution model, the present embodiment additionally provides a kind of trick distribution model generation side Method, this method will be illustrated by way of example in the present embodiment, it is assumed that the information of user's input, which is concentrated, to be indicated oneself to think Pay close attention to the information on bidding of first company and company B, 9 points of first company corresponding enterprise web site every morning into row information more Newly, the number of visiting people is most in equal 9 points to 10 points this hours of morning daily, and period the number of visiting people later is gradually reduced; Update of 8 points of the company B's corresponding enterprise's every morning into row information, but between 8 points to 9 points of every morning the number of visiting people compared with Few, the number of visiting people is more between 9 points to ten one points, remaining time the number of visiting people is consistent.And the enterprise web site of first company is average daily Amount of access is more than company B, which is accustomed to ten one points of every morning and has checked whether information on bidding update, just no longer look into later It sees, first checks that first company corresponds to information on bidding and checks that company B corresponds to information on bidding again when checking every time.
Its specific implementation step is as follows:
S1: user's fill message collection, it includes the desired concern company filled in as first company and company B which, which integrates,.
S2: director server obtains this and corresponds to the average daily the number of visiting people of enterprise web site according to information collection acquisition first company and company B Information list, access time day record information and company information renewal time information.Wherein, the enterprise web site of acquisition daily accesses Before number information Dan Zhongjia company comes, after company B comes;Access time day that enterprise web site obtains is being corresponded to from first company It is most that the number of visiting people in equal 9 points to 10 points this hours of morning daily is had recorded in record information, period access later The relevant information that number is gradually reduced, and correspond to from company B and recorded in the access time day record information obtained on enterprise web site The number of visiting people is less between 8 points to 9 points of the morning, and the number of visiting people is more between 9 points to ten one points, the access of remaining time The consistent relevant information of number.
S3: director server corresponds to enterprise web site progress information update in first company and company B daily while carrying out bid letter Breath related content crawls.Which checked in this way regardless of user in period, as long as enterprise web site has updated information, and information crawler Success, user, which just can view, corresponding crawls information.
S4: director server obtains the user and logs in letter system daily and check the time of information, generates user and checks information Regular table.And user checks in information rule table that having recorded user to first log into system time daily is ten one points of the morning.
S5: director server records information, temporal information according to the average daily the number of visiting people information list of enterprise web site, access time day Record sheet and user check that information rule table generates distribution model.When distribution model generates, the two enterprise web sites are first determined whether Belong to any in three types, is since two enterprise web sites daily information update time first logs into user daily Before the system time, so two enterprise web site judging results are the first seed type.Secondly judge that the best information of Liang Jia enterprise is climbed Take the time, judging result is that first company is 10 AM best access time between ten one points, company B's best access time Between 8 points to 9 points of the morning, it is not identical that the best information of Liang Jia enterprise crawls the time, then arrangement information crawl respectively Server crawls in the corresponding best information of Liang Jia enterprise and carries out information crawler in time range.
Embodiment three
Embodiment three and the difference of embodiment two be, director server in embodiment two further include:
Information management module, for concentrating the enterprise name and user's access of label concern from the information of each user The correspondence enterprise name (enterprise name recorded on enterprise web site) for crawling information source, then counts all registration users In how many user concern, subscribe to or consulted the information crawled on the enterprise web site, and generate user pay close attention to information record Table.
Information adjustment module is crawled, checks that the user recorded in information rule table steps on for the first time daily for obtaining all users The recording system time, and carry out generating user's login time arrangement table after successively sorting, then according to user's login time arrangement table And user pays close attention to information record sheet and judges which user to first log into system time daily that be closest to the enterprise web site every Its renewal time, and the user first logs into system time after the enterprise web site daily renewal time daily, claims such use Family is the enterprise close to user, then the work of the enterprise web site information crawler is to be looked into according to this close to the corresponding user of user It sees that the information recorded in information rule table executes distribution model, and is obtaining the corresponding information on bidding of the enterprise web site or acceptance of the bid letter No longer same information will be carried out to the enterprise web site after breath to crawl, i.e., the information on bidding announced of the enterprise or acceptance of the bid information will be The enterprise web site, which updates during information time first logs into system time close to user with it daily, to be crawled.
Embodiment three avoids what different user concern same enterprise website was announced compared with embodiment two in embodiment three When information on bidding and acceptance of the bid information, without repeating the same enterprise web site information on bidding and acceptance of the bid letter for each user Breath crawls.
What has been described above is only an embodiment of the present invention, and the common sense such as well known specific structure and characteristic are not made herein in scheme Excessive description, technical field that the present invention belongs to is all before one skilled in the art know the applying date or priority date Ordinary technical knowledge can know the prior art all in the field, and have using routine experiment hand before the date The ability of section, one skilled in the art can improve and be implemented in conjunction with self-ability under the enlightenment that the application provides This programme, some typical known features or known method should not become one skilled in the art and implement the application Obstacle.It should be pointed out that for those skilled in the art, without departing from the structure of the invention, can also make Several modifications and improvements out, these also should be considered as protection scope of the present invention, these all will not influence the effect that the present invention is implemented Fruit and patent practicability.The scope of protection required by this application should be based on the content of the claims, the tool in specification The records such as body embodiment can be used for explaining the content of claim.

Claims (6)

1. bidding related web page page info extracting method, characterized by the following steps:
S1: the nodes of locations webpage where obtaining relevant information automatically according to the relevant keyword that calls for bid and get the bid in webpage;
S2: the shared father node webpage of the keyword is found according to the nodes of locations webpage of acquisition;
S3: judging whether the father node webpage obtained has been acquired, should by judgement if father node webpage was not acquired Information crawler is carried out according to its arrangement mode after father node web page contents arrangement mode;
S4: are carried out by storage and is shown for the information on bidding and acceptance of the bid information that crawl.
2. bidding related web page page info extracting method according to claim 1, it is characterised in that: in step S1- In S4, the crawl of information on bidding and acceptance of the bid information is to distribute corresponding information scratching according to allocation criteria by director server to take It is engaged in what device was completed.
3. bidding related web page page info extracting method according to claim 2, it is characterised in that: allocation criteria is raw Cheng Shi, the renewal time of site information first on acquisition enterprise web site, then to network information update on each enterprise web site Time is successively sorted, while being ranked up according to the average daily the number of visiting people of each enterprise web site, if multiple enterprises The network information update time is identical on website, then preferentially grabs to the preferential execution information of enterprise web site more than averagely daily the number of visiting people Instruction fetch, if each enterprise web site more New Network Information time is different, according to the sequencing of enterprise web site renewal time Execution information fetching instruction, and each station information crawl child servers will successively be pacified according to the execution information fetching instruction time Row executes.
4. bidding related web page page info extracting method according to claim 1, it is characterised in that: in step S3 In, after carrying out information on bidding and acceptance of the bid information crawler, the information on bidding crawled will be randomly selected and acceptance of the bid information correct errors testing Card.
5. bid related web page page info extraction system, it is characterised in that: include: user terminal, director server and multiple Information scratching child servers;
User terminal is for user's registration, login, concern and subscribes to information on bidding and acceptance of the bid information;
Director server is used to generate the allocation criteria of information scratching child servers, then distributes information scratching according to the allocation criteria Child servers carry out the crawl of corresponding enterprise web site Tender Based information and information of getting the bid.
6. bid related web page page info extraction system according to claim 5, it is characterised in that: the director server Classify including user and limit module, the user, which classifies, limits module for carrying out delineation of power to the user of registration, is divided into general General family, document human user and system testers user, ordinary user can be carried out access information reading after buying member, Document human user can not only read access information, moreover it is possible to write to access information, system testers user It can carry out information reading, write and software test.
CN201811481859.3A 2018-12-05 2018-12-05 Method and system for extracting page information of bid-inviting and bidding related webpage Active CN109597927B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811481859.3A CN109597927B (en) 2018-12-05 2018-12-05 Method and system for extracting page information of bid-inviting and bidding related webpage

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811481859.3A CN109597927B (en) 2018-12-05 2018-12-05 Method and system for extracting page information of bid-inviting and bidding related webpage

Publications (2)

Publication Number Publication Date
CN109597927A true CN109597927A (en) 2019-04-09
CN109597927B CN109597927B (en) 2022-11-18

Family

ID=65961182

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811481859.3A Active CN109597927B (en) 2018-12-05 2018-12-05 Method and system for extracting page information of bid-inviting and bidding related webpage

Country Status (1)

Country Link
CN (1) CN109597927B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110417873A (en) * 2019-07-08 2019-11-05 上海鸿翼软件技术股份有限公司 A kind of network information extraction system for realizing record webpage interactive operation
CN110502680A (en) * 2019-08-27 2019-11-26 重庆大司空信息科技有限公司 A kind of abstracting method and device of acceptance of the bid bulletin relevant field

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20020005534A (en) * 2001-11-07 2002-01-17 주식회사 한성정보통신 Tender information management system for electronic tender and tender service providing method using the system
US20100250516A1 (en) * 2009-03-28 2010-09-30 Microsoft Corporation Method and apparatus for web crawling
CN102347930A (en) * 2010-07-26 2012-02-08 中国电信股份有限公司 Method and system for obtaining webpage content
CN103617225A (en) * 2013-11-25 2014-03-05 北京奇虎科技有限公司 Associated webpage searching method and system
CN105069112A (en) * 2015-08-11 2015-11-18 浪潮软件集团有限公司 Industry vertical search engine system
CN105117501A (en) * 2015-10-09 2015-12-02 广州神马移动信息科技有限公司 Web crawler scheduling method and web crawler system applying same
CN105912552A (en) * 2015-12-23 2016-08-31 乐视网信息技术(北京)股份有限公司 Method for capturing webpage video and terminal device for capturing webpage video
CN106960063A (en) * 2017-04-20 2017-07-18 广州优亚信息技术有限公司 A kind of internet information crawl and commending system for field of inviting outside investment
CN108563679A (en) * 2018-03-06 2018-09-21 广西友信矿业有限公司 Quarrying Information Acquisition System based on information collection and method

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20020005534A (en) * 2001-11-07 2002-01-17 주식회사 한성정보통신 Tender information management system for electronic tender and tender service providing method using the system
US20100250516A1 (en) * 2009-03-28 2010-09-30 Microsoft Corporation Method and apparatus for web crawling
CN102347930A (en) * 2010-07-26 2012-02-08 中国电信股份有限公司 Method and system for obtaining webpage content
CN103617225A (en) * 2013-11-25 2014-03-05 北京奇虎科技有限公司 Associated webpage searching method and system
CN105069112A (en) * 2015-08-11 2015-11-18 浪潮软件集团有限公司 Industry vertical search engine system
CN105117501A (en) * 2015-10-09 2015-12-02 广州神马移动信息科技有限公司 Web crawler scheduling method and web crawler system applying same
CN105912552A (en) * 2015-12-23 2016-08-31 乐视网信息技术(北京)股份有限公司 Method for capturing webpage video and terminal device for capturing webpage video
CN106960063A (en) * 2017-04-20 2017-07-18 广州优亚信息技术有限公司 A kind of internet information crawl and commending system for field of inviting outside investment
CN108563679A (en) * 2018-03-06 2018-09-21 广西友信矿业有限公司 Quarrying Information Acquisition System based on information collection and method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
T. MATSUDA ETC.: "An efficient Internet crawling and filtering system for the nationwide tendering information retrieval", 《PROCEEDINGS IEEE/WIC INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE (WI 2003)》 *
冯思平: "Web招标信息搜索及管理***的设计", 《中国优秀硕士学位论文全文数据库(信息科技辑)》 *
吴敏丽: "基于主题搜索引擎的文本聚类分类研究与实现", 《中国优秀硕士学位论文全文数据库(信息科技辑)》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110417873A (en) * 2019-07-08 2019-11-05 上海鸿翼软件技术股份有限公司 A kind of network information extraction system for realizing record webpage interactive operation
CN110417873B (en) * 2019-07-08 2021-04-02 上海鸿翼软件技术股份有限公司 Network information extraction system for realizing recording webpage interactive operation
CN110502680A (en) * 2019-08-27 2019-11-26 重庆大司空信息科技有限公司 A kind of abstracting method and device of acceptance of the bid bulletin relevant field

Also Published As

Publication number Publication date
CN109597927B (en) 2022-11-18

Similar Documents

Publication Publication Date Title
US8046387B2 (en) System and method for providing intelligence centers
US20070033092A1 (en) Computer-implemented method and system for collaborative product evaluation
US20070192279A1 (en) Advertising in a Database of Documents
US20070219940A1 (en) Merchant Tool for Embedding Advertisement Hyperlinks to Words in a Database of Documents
US20180137574A1 (en) Dashboard interface, platform, and environment for matching subscribers with subscription providers and presenting enhanced subscription provider performance metrics
DE10244974A1 (en) Automatic advertiser notification for a system for providing place and price protection in a search result list generated by a computer network search engine
CN107256497A (en) Advertisement delivery system
CN1754181A (en) A surveying apparatus and method thereof
US20090228339A1 (en) Method and system for revenue per reverse redirect
WO2008109485A1 (en) Personalized shopping recommendation based on search units
DE10256458A1 (en) Recommend search terms using collaborative filtering and web spidering
US20070265926A1 (en) Automated product selection and management system
WO2009051946A1 (en) Method and system for creating superior informational guides
CN101266671A (en) A network advertisement pricing method and system
CN113159972B (en) Combination determination method, device, electronic equipment and computer readable storage medium
CN102214183A (en) Search engine query method for combining feedback contents of pages with fixed ranking
DE10247532A1 (en) Parts list utilization system, transmits transaction documents and orders to suppliers after receiving parts identifier and supplier data for each identified part
CN107533676A (en) Operating lag in fixed allocation content selection infrastructure is reduced
CN109597927A (en) Bidding related web page page info extracting method and system
CN106294410A (en) A kind of determination method of personalized information push time and determine system
US20110320456A1 (en) Tips management system and process for managing organization-wide knowledge tips
CN102243634A (en) Data statistical method and system
CN108255584A (en) A kind of work flow processing method and system and computer readable storage medium
CN109670097B (en) Method and system for scheduling crawling tasks of bidding related web pages
US20040143487A1 (en) System and method for integration of material costs of a product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant