CN113704667A - Automatic extraction processing method and device for bidding announcement - Google Patents

Automatic extraction processing method and device for bidding announcement Download PDF

Info

Publication number
CN113704667A
CN113704667A CN202111017828.4A CN202111017828A CN113704667A CN 113704667 A CN113704667 A CN 113704667A CN 202111017828 A CN202111017828 A CN 202111017828A CN 113704667 A CN113704667 A CN 113704667A
Authority
CN
China
Prior art keywords
webpage
label
text content
text
content
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111017828.4A
Other languages
Chinese (zh)
Other versions
CN113704667B (en
Inventor
姚从磊
陈浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Bailian Intelligent Technology Co ltd
Original Assignee
Beijing Bailian Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Bailian Intelligent Technology Co ltd filed Critical Beijing Bailian Intelligent Technology Co ltd
Priority to CN202111017828.4A priority Critical patent/CN113704667B/en
Publication of CN113704667A publication Critical patent/CN113704667A/en
Application granted granted Critical
Publication of CN113704667B publication Critical patent/CN113704667B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • G06F16/986Document structures and storage, e.g. HTML extensions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/06Buying, selling or leasing transactions
    • G06Q30/0601Electronic shopping [e-shopping]
    • G06Q30/0611Request for offers or quotes
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • Databases & Information Systems (AREA)
  • Finance (AREA)
  • Accounting & Taxation (AREA)
  • General Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Development Economics (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Probability & Statistics with Applications (AREA)
  • Data Mining & Analysis (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The application discloses a method and a device for automatically extracting and processing a bid announcement, wherein the method comprises the following steps: capturing a webpage according to the webpage address; extracting text contents related to the webpage from the webpage information, wherein the text contents are extracted according to corresponding tags in an HTML (hypertext markup language) language used by the webpage, the text contents are displayed in the webpage, and the text contents are obtained by splicing texts corresponding to the tags according to the sequence of the tags appearing in a webpage source code; determining the text content as a bidding announcement from the text content; and acquiring keywords from the text content and determining the corresponding target object in the bid announcement according to the keywords, wherein the keywords are used for indicating the target object and are configured in advance. Through the method and the device, the problem that the bidding announcement needs to be acquired manually in the prior art is solved, so that the acquisition efficiency of the bidding announcement is improved, and the bidding announcement is prevented from being missed.

Description

Automatic extraction processing method and device for bidding announcement
Technical Field
The application relates to the field of webpage text processing, in particular to an automatic extraction processing method and device for a bidding announcement.
Background
The bid announcement is generally issued by government agencies, enterprises and institutions, intermediary agencies and the like participating in the bid process among various links of the bid activity in a website of the owner or a special medium website of a third party, and is used for disclosing key events, important information and the like in the bid activity. There are tens of thousands of websites that issue bidding announcements throughout the country. The average total number of announcements released per day is over 10 million.
The target object is a commodity or a service required to be purchased by a buyer and a tenderer in a bidding activity, and is information having a high commercial value in a bidding announcement. The variety of the object is very rich, and the expression is various, wherein professional words in specific fields are not lacked. It is valuable to quickly and accurately find out the specific target objects concerned by the user from the numerous and complicated target objects in the mass bidding bulletins.
At present, the bidding bulletin is obtained manually, and the processing method consumes a large amount of human resources and is easy to cause omission of the bidding bulletin.
Disclosure of Invention
The embodiment of the application provides an automatic extraction processing method and device of a bidding announcement, and aims to at least solve the problem caused by the need of manually acquiring the bidding announcement in the prior art.
According to an aspect of the present application, there is provided an automatic extraction processing method of a bid announcement, including: capturing a webpage according to the webpage address; extracting text content related to the webpage from the webpage information, wherein the text content is extracted according to a corresponding label in an HTML language used by the webpage, the text content is displayed in the webpage, and the text content is obtained by splicing texts corresponding to the label according to the sequence of the label appearing in the webpage source code; determining the text content as a bidding announcement from the text content; and acquiring keywords from the text content and determining a corresponding target object in the bid announcement according to the keywords, wherein the keywords are used for indicating the target object, and the keywords are configured in advance.
Further, crawling the web page according to the web page address comprises: receiving at least one address of a website or webpage configured by a user; and grabbing the webpage according to the at least one address according to a preset period.
Further, extracting the text content related to the web page according to the corresponding tag in the HTML language used by the web page includes: acquiring first tag content used for indicating text content in the HTML language; extracting texts in the first label content; acquiring second label content for indicating the format of the text; and setting the format of the text according to the second label content to obtain the text content.
Further, extracting the text content related to the web page according to the corresponding tag in the HTML language used by the web page includes: acquiring a third label used for indicating a form in the webpage; extracting contents in the rows and the columns according to the indicated rows and columns in the third label, and arranging the contents into text contents in a predetermined format according to the rows and the columns, wherein the predetermined format is configured in advance.
Further, extracting the text content related to the web page according to the corresponding tag in the HTML language used by the web page includes: acquiring a fourth label used for indicating that a file is embedded in the webpage; and acquiring the type of the file in the fourth label, and calling a corresponding character extraction tool according to the type of the file to extract the text content from the file.
According to another aspect of the present application, there is also provided an automatic extraction processing apparatus of a bid notice, including: the grabbing module is used for grabbing the webpage according to the webpage address; the extraction module is used for extracting text content related to the webpage from the webpage information, wherein the text content is extracted according to a corresponding label in an HTML language used by the webpage, the text content is displayed in the webpage, and the text content is obtained by splicing texts corresponding to the label according to the sequence of the label appearing in the webpage source code; the first determining module is used for determining the text content as a bidding announcement from the text content; a second determining module, configured to obtain a keyword from the text content and determine a corresponding target object in the bid announcement according to the keyword, where the keyword is used to indicate the target object, and the keyword is preconfigured.
Further, the grasping module is configured to: receiving at least one address of a website or webpage configured by a user; and grabbing the webpage according to the at least one address according to a preset period.
Further, the extraction module is configured to: acquiring first tag content used for indicating text content in the HTML language; extracting texts in the first label content; acquiring second label content for indicating the format of the text; and setting the format of the text according to the second label content to obtain the text content.
Further, the extraction module is configured to: acquiring a third label used for indicating a form in the webpage; extracting contents in the rows and the columns according to the indicated rows and columns in the third label, and arranging the contents into text contents in a predetermined format according to the rows and the columns, wherein the predetermined format is configured in advance.
Further, the extraction module is configured to: acquiring a fourth label used for indicating that a file is embedded in the webpage; and acquiring the type of the file in the fourth label, and calling a corresponding character extraction tool according to the type of the file to extract the text content from the file.
In the embodiment of the application, the method comprises the steps of grabbing a webpage according to a webpage address; extracting text content related to the webpage from the webpage information, wherein the text content is extracted according to a corresponding label in an HTML language used by the webpage, the text content is displayed in the webpage, and the text content is obtained by splicing texts corresponding to the label according to the sequence of the label appearing in the webpage source code; determining the text content as a bidding announcement from the text content; and acquiring keywords from the text content and determining a corresponding target object in the bid announcement according to the keywords, wherein the keywords are used for indicating the target object, and the keywords are configured in advance. Through the method and the device, the problem that the bidding announcement needs to be acquired manually in the prior art is solved, so that the acquisition efficiency of the bidding announcement is improved, and the bidding announcement is prevented from being missed.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application. In the drawings:
fig. 1 is a flowchart of an automatic extraction processing method of a bid announcement according to an embodiment of the present application.
Detailed Description
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.
It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowcharts, in some cases, the steps illustrated or described may be performed in an order different than presented herein.
In the present embodiment, an automatic extraction processing method of a bidding announcement is provided, and fig. 1 is a flowchart of an automatic extraction processing method of a bidding announcement according to an embodiment of the present application, and as shown in fig. 1, the flowchart includes the following steps:
step S102, capturing a webpage according to a webpage address;
in this step, at least one address of a website or web page configured by a user may be received; and grabbing the webpage according to the at least one address according to a preset period.
Step S104, extracting text content related to the webpage from the webpage information, wherein the text content is extracted according to a corresponding label in an HTML language used by the webpage, the text content is displayed in the webpage, and the text content is obtained by splicing texts corresponding to the label according to the sequence of the label appearing in the webpage source code;
step S106, determining the text content as a bidding announcement from the text content;
the step may also be implemented by machine learning, and a third machine learning model may be trained, where the model is obtained by using multiple sets of third training data, and each set of third training data includes input data and output data, where the input data is a text content, and the output data is used to identify whether the text content is a label of a bidding announcement. After training, the third machine learning model can be used, the text content is input, the third machine learning model can output whether the text content is a bidding announcement, and if so, step S108 is executed.
Step S108, obtaining keywords from the text content and determining corresponding target objects in the bid announcement according to the keywords, wherein the keywords are used for indicating the target objects, and the keywords are configured in advance.
The step may also be implemented by machine learning, and a second machine learning model may be trained, where the model is obtained by training using multiple sets of second training data, and each set of second training data includes input data and output data, where the input data is a text content, and the output data is a target object. After training, the second machine learning model may be used, and the second machine learning model may output the subject matter upon inputting the textual content.
In another alternative, the keywords of the context of the output target object are acquired and saved, all the keywords saved first are searched in the text content, that is, the keywords of all the contexts saved are taken as the keywords to be acquired in step S108, and then the target object can be found in the bid notice according to the acquired keywords.
The target object output by the second machine learning module and the target object found according to the keyword can be compared, if the target object is consistent with the keyword, the target object is found successfully, and if the target object is inconsistent with the keyword, the target object found is displayed to the user.
Through the steps, the problem caused by the fact that the bidding announcement needs to be acquired manually in the prior art is solved, so that the acquisition efficiency of the bidding announcement is improved, and the bidding announcement is prevented from being missed.
Extracting text content can be extracted according to the type of the tag, for example, a first tag content used for indicating the text content in the HTML language can be obtained; extracting texts in the first label content; acquiring second label content for indicating the format of the text; and setting the format of the text according to the second label content to obtain the text content.
For another example, a third tag for indicating a form in the web page is obtained; extracting contents in the rows and the columns according to the indicated rows and columns in the third label, and arranging the contents into text contents in a predetermined format according to the rows and the columns, wherein the predetermined format is configured in advance.
In an alternative embodiment, the text content may be extracted from the form using a first machine learning model trained using a plurality of sets of training data, each set of training data including input data and output data, the input data being source codes of an HTML web page including the form, and the output data being text content obtained by arranging text in the form in a predetermined format. After training, the first machine learning model is used, the web page in HTML format is input into the first machine learning model, and the text content is output from the second machine learning model.
For another example, the extraction module is configured to: acquiring a fourth label used for indicating that a file is embedded in the webpage; and acquiring the type of the file in the fourth label, and calling a corresponding character extraction tool according to the type of the file to extract the text content from the file.
This is described below in connection with an alternative embodiment.
In the optional embodiment, the format of the bidding announcement mainly takes an HTML webpage as a main part, and meanwhile, the formats released by a large number of websites are PDF, pictures, Flash and the like. Announcements typically do not have a fixed format. The differences of the bulletin writer in terms of wording, computer proficiency and the like and the technical and tool differences of the publishing platforms of all the people lead to great differences in various aspects of the expression mode of the bidding bulletin in characters, the presentation sequence logic of information, the rigor degree of data, the standardization degree of tables, the format of display and the like. The present embodiment provides a solution.
The present embodiment can be introduced from two aspects, the first aspect is to extract the subject matter from the bidding announcement, and firstly, it is necessary to obtain the effective announcement text content from the announcement files with different typesetting formats. The method for extracting the plain text from the HTML webpage mainly comprises the steps of completely deleting an HTML tag based on a rule, calling the library of the embodiment and directly extracting the text.
The process of this embodiment from the original bid post to plain text is as follows:
obtaining valid text from HTML, the direct deletion of HTML tags results in a substantially complete loss of paragraph information. The conversion function provided by the present embodiment is able to achieve relatively correct results for the organization. In addition, tables are generally used in bulletins to show list information, and at present, there is no good and general method for analyzing table data in this scenario. This may result in a misalignment of the information, causing difficulty and interference in extracting the subject matter. The invention greatly improves the extraction of the text information with correct expression.
In the embodiment, in the part of converting HTML into text for a bid and bid notice, an analyzer specially designed for the bid and bid notice is adopted, the main logic is to analyze from top to bottom based on a DOM tree, and special processing logics of special tags are defined, such as < i >, </b > and other tags which do not affect the line logic, the tags are deleted, for < br > tags, line feed or line feed plus empty lines are processed according to the difference between the front tag and the back tag and the parent node tag, especially for < table > tags, special processing is performed, a deep learning method is adopted to obtain header keyword information in a table, and then the header keyword information is converted into a text which is easy to process by a computer according to different table structures.
The extraction of plain text from HTML in this embodiment is performed for one example as follows: firstly, cleaning an HTML text, and the steps roughly comprise: deleting special characters, such as: <200d >, \\ u200d replaces the toned Latin character with a similar character without tones: normalize ('NFKD', text); deleting the emoji character; converting traditional Chinese characters into simplified Chinese characters; clean up non-essential tags in HTML, such as: < script/>, < noscript/>, < style/>, < block/> …; repairing the wrong < table/> tag, mainly correcting the missing or redundant < tr/>, < th/>, < td/>, and the wrong colspan and rowspan; then, sequentially analyzing elements in the HTML by adopting a front-end traversal mode, customizing a processor aiming at a specific HTML label, such as an Imageprocessor aiming at < image/> and extracting characters in the picture by adopting an OCR technology; for the UlProcessor of < ul/> the list is processed as a whole; aiming at the BrProcessor of < br/> and combining a context label to judge whether an additional empty line needs to be added or not so as to embody the segmentation between paragraphs; and specially processing the table labels aiming at the TableProcessor of < table/> so that the converted text can be suitable for calculation and processing to the maximum extent.
The TableProcessor in this embodiment may perform the following steps: firstly, analyzing the label and converting the label into a value object Table, wherein the structural main body of the value object Table is a two-dimensional array corresponding to each cell in the Table, and each array originally comprises characters in the cell of the Table and colspan and rowspan information of the cell. And whether it is a merged cell (including horizontal and vertical). A deep learning model is then built to determine which cells are headers. The processor is then designed in conjunction with the layout structure of the form itself. Let K denote the header and V denote the non-header, i.e. the value corresponding to the header. Typical arrangements are as follows: KVKV: this form needs to be divided into two types, one is that this case needs no special treatment, only needs to be pieced together in turn by using reasonable separators, such as K: v, K: v; another situation is that the table content is divided into two parts, the left part and the right part, and the two parts respectively represent similar attributes of two different mechanisms, and when the two parts are processed, half of the table content needs to be processed first, and then the other half of the table content needs to be processed. The treatment method is also K: and V splicing. The first row is all K, and any multiple of the following rows are all V: starting from the second line, the iteration is organized into a plurality of lines K1: v1, K2: v2, … the first row and column are all K, the remainder are V: and splicing by taking K in the first row as a row key and K in the first column as a field key, wherein the splicing comprises the following steps: row K, column K1: v1, K column 2: v2, ….
The second aspect is to extract the subject matter from the text, and the core part belongs to the keyword extraction.
The method comprises the steps of extracting a target part from a pure text, specially collecting and sorting a batch of target part word lists of various industries aiming at the problem that the target part word lists are quite diverged in distribution, and normalizing the target part according to the target part word lists on training data to reduce sparse distribution. And in the prediction stage, the target object extracted by the model is compared with the target object vocabulary through a maximum matching rule to expand or cut the target object extracted by the model.
The key word extraction mainstream scheme includes a rule-based extraction method, a statistics-based extraction method, and a machine learning-based extraction method.
The rule-based method is to manually set a word expression rule, and information conforming to the expression rule can be extracted. Regular expressions are generally employed. The advantage is that the expression that meets a specific rule must or must not be extracted, and the disadvantage is that the type of expression itself is unpredictable, requiring a lot of rules to be written, requiring a lot of manual summaries, where there is also the possibility of conflicts between the rules. For the subject matter, recall rates are quite low due to the distribution diversity of its expression.
The statistical-based method is to calculate the weight of the word in the document based on statistical information, such as word frequency, and extract keywords according to the weighted ranking. The main implementations are TF-IDF [1] and TextRank [2 ]. The TF-IDF method obtains word weight by calculating word Frequency (TF) and Inverse text Frequency Index (IDF); the TextRank method is based on the idea of PageRank, a co-occurrence network is constructed through a word co-occurrence window, and word scores are calculated. The method is simple and easy to implement and has strong applicability. The statistical-based method needs to be combined with other processing to complete the object extraction problem, for example, sentences possibly containing objects need to be located, and in the sentence sets, part-of-speech tagging is combined for extraction. This step of locating sentences that may contain the subject matter may not necessarily yield reliable results, and may therefore yield unnecessary results or may miss useful information.
The machine learning-based method comprises supervised learning methods such as SVM and naive Bayes, and unsupervised learning methods such as K-means and hierarchical clustering. In such methods, the quality of the model depends on feature extraction, and deep learning is an effective way of feature extraction. The Word2Vec Word vector model, introduced by Google, is a representative learning tool in the field of natural language. It maps the dictionary to a more abstract vector space in the process of training the language model, each word is represented by a high-dimensional vector, and the distance between two points in the vector space corresponds to the similarity degree of the two words. The better performing Bert model was still derived by Google later, and can be used for training and keyword extraction in this embodiment. The method based on machine learning is a method with better effect at present. However, the data used for training the Bert model has a low degree of coincidence with the distribution of the target objects in the bidding field, and the distribution of the target objects is added, so that the performance of the target objects is poor, and especially word segmentation errors occur, such as more words and fewer words, so that the extracted target objects are not legal phrases, real goods or services, or goods or services faithful to the original announcements.
In the embodiment, a named entity recognition method in deep learning is adopted to extract the object. A common model for named entity recognition is Bert (bidirectional Encoder reproduction from transformations) + LSTM (Long Short Term memory) + CRF (conditional Random field). The Bert serves as a universal pre-training model and plays a role of embedding, so that characters have high expression capacity in the model. Bert performs well in various downstream tasks, but has a cost disadvantage for large-scale use due to its large resource requirement. The LSTM can capture the sequence relation of the text front and back, is very suitable for processing the sequence problem, but cannot be parallelized, and has slightly insufficient performance. The CRF is used for modeling the finally output prediction sequence and ensuring the rationality of the prediction sequence.
In this example, a model of Electrora + Transformamer Encoder + CRF was used. Electra is a variant of Bert, which uses structural training similar to that used to generate countermeasure networks to weigh the size and performance of the model, and is more suitable for the use of massive data on a large scale. The Transformer Encoder is an Encoder part in a Transformer structure and is used for increasing the fitting capacity of a model.
In the embodiment, the enhancement of the target object vocabularies is also added, and a certain number of target object vocabularies which are subjected to error investigation are directly added into the word list of the Electrora to avoid the problem of error phrase boundaries caused by the sparsity of the target object words.
The Electra mentioned in this embodiment plays a role of Embedding Module in the model of Electra + Transformamer Encoder + CRF, and is responsible for mapping the text into meaningful matrix, and then sending to the next layer of Transformar, where only the Encode part in the Transformar is used, and then sending its result to the next layer of CRF, and the result of CRF can give which words are concatenated into strings as the target.
In this embodiment, a deep learning classification model is used to find header information in an HTML table, and an algorithm is used to analyze a correct text. In this embodiment, the target object vocabulary is further used to normalize the target object extracted by the deep learning model.
In this embodiment, an electronic device is provided, comprising a memory in which a computer program is stored and a processor configured to run the computer program to perform the method in the above embodiments.
The programs described above may be run on a processor or may also be stored in memory (or referred to as computer-readable media), which includes both non-transitory and non-transitory, removable and non-removable media, that implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
These computer programs may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks, and corresponding steps may be implemented by different modules.
Such an apparatus and system is provided in this embodiment. The device is called an automatic extraction processing device of the bidding bulletin, and comprises: the grabbing module is used for grabbing the webpage according to the webpage address; the extraction module is used for extracting text content related to the webpage from the webpage information, wherein the text content is extracted according to a corresponding label in an HTML language used by the webpage, the text content is displayed in the webpage, and the text content is obtained by splicing texts corresponding to the label according to the sequence of the label appearing in the webpage source code; the first determining module is used for determining the text content as a bidding announcement from the text content; a second determining module, configured to obtain a keyword from the text content and determine a corresponding target object in the bid announcement according to the keyword, where the keyword is used to indicate the target object, and the keyword is preconfigured.
The system or the apparatus is used for implementing the functions of the method in the foregoing embodiments, and each module in the system or the apparatus corresponds to each step in the method, which has been described in the method and is not described herein again.
For example, the grasping module is configured to: receiving at least one address of a website or webpage configured by a user; and grabbing the webpage according to the at least one address according to a preset period.
Extracting text content can be extracted according to the type of the tag, for example, a first tag content used for indicating the text content in the HTML language can be obtained; extracting texts in the first label content; acquiring second label content for indicating the format of the text; and setting the format of the text according to the second label content to obtain the text content.
For another example, a third tag for indicating a form in the web page is obtained; extracting contents in the rows and the columns according to the indicated rows and columns in the third label, and arranging the contents into text contents in a predetermined format according to the rows and the columns, wherein the predetermined format is configured in advance.
For another example, the extraction module is configured to: acquiring a fourth label used for indicating that a file is embedded in the webpage; and acquiring the type of the file in the fourth label, and calling a corresponding character extraction tool according to the type of the file to extract the text content from the file.
In the embodiment, in the process from the bidding announcement HTML webpage to the plain text, special processing is performed on the bidding announcement, so that the obtained plain text is more beneficial for a computer to extract the target object from the page. Especially, the text obtained by conversion retains the information with complete semantics aiming at the special processing of the table information; the standardization and unification of the subject matter are processed, so that the subject matter is less prone to error, missing, multiple characters and few characters.
The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (10)

1. An automatic extraction processing method of a bidding announcement is characterized by comprising the following steps:
capturing a webpage according to the webpage address;
extracting text content related to the webpage from the webpage information, wherein the text content is extracted according to a corresponding label in an HTML language used by the webpage, the text content is displayed in the webpage, and the text content is obtained by splicing texts corresponding to the label according to the sequence of the label appearing in the webpage source code;
determining the text content as a bidding announcement from the text content;
and acquiring keywords from the text content and determining a corresponding target object in the bid announcement according to the keywords, wherein the keywords are used for indicating the target object, and the keywords are configured in advance.
2. The method of claim 1, wherein crawling the web page according to the web page address comprises:
receiving at least one address of a website or webpage configured by a user;
and grabbing the webpage according to the at least one address according to a preset period.
3. The method of claim 1, wherein extracting the text content related to the web page according to the corresponding tag in the HTML language used by the web page comprises:
acquiring first tag content used for indicating text content in the HTML language;
extracting texts in the first label content;
acquiring second label content for indicating the format of the text;
and setting the format of the text according to the second label content to obtain the text content.
4. The method according to any one of claims 1 to 3, wherein extracting the text content related to the web page according to the corresponding tag in the HTML language used by the web page comprises:
acquiring a third label used for indicating a form in the webpage;
extracting contents in the rows and the columns according to the indicated rows and columns in the third label, and arranging the contents into text contents in a predetermined format according to the rows and the columns, wherein the predetermined format is configured in advance.
5. The method according to any one of claims 1 to 3, wherein extracting the text content related to the web page according to the corresponding tag in the HTML language used by the web page comprises:
acquiring a fourth label used for indicating that a file is embedded in the webpage;
and acquiring the type of the file in the fourth label, and calling a corresponding character extraction tool according to the type of the file to extract the text content from the file.
6. An automatic extraction processing apparatus of a bid notice, comprising:
the grabbing module is used for grabbing the webpage according to the webpage address;
the extraction module is used for extracting text content related to the webpage from the webpage information, wherein the text content is extracted according to a corresponding label in an HTML language used by the webpage, the text content is displayed in the webpage, and the text content is obtained by splicing texts corresponding to the label according to the sequence of the label appearing in the webpage source code;
the first determining module is used for determining the text content as a bidding announcement from the text content;
a second determining module, configured to obtain a keyword from the text content and determine a corresponding target object in the bid announcement according to the keyword, where the keyword is used to indicate the target object, and the keyword is preconfigured.
7. The apparatus of claim 6, wherein the grasping module is to:
receiving at least one address of a website or webpage configured by a user;
and grabbing the webpage according to the at least one address according to a preset period.
8. The apparatus of claim 6, wherein the extraction module is configured to:
acquiring first tag content used for indicating text content in the HTML language;
extracting texts in the first label content;
acquiring second label content for indicating the format of the text;
and setting the format of the text according to the second label content to obtain the text content.
9. The apparatus of any one of claims 6 to 8, wherein the extraction module is configured to:
acquiring a third label used for indicating a form in the webpage;
extracting contents in the rows and the columns according to the indicated rows and columns in the third label, and arranging the contents into text contents in a predetermined format according to the rows and the columns, wherein the predetermined format is configured in advance.
10. The apparatus of any one of claims 6 to 8, wherein the extraction module is configured to:
acquiring a fourth label used for indicating that a file is embedded in the webpage;
and acquiring the type of the file in the fourth label, and calling a corresponding character extraction tool according to the type of the file to extract the text content from the file.
CN202111017828.4A 2021-08-31 2021-08-31 Automatic extraction processing method and device for bid announcement Active CN113704667B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111017828.4A CN113704667B (en) 2021-08-31 2021-08-31 Automatic extraction processing method and device for bid announcement

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111017828.4A CN113704667B (en) 2021-08-31 2021-08-31 Automatic extraction processing method and device for bid announcement

Publications (2)

Publication Number Publication Date
CN113704667A true CN113704667A (en) 2021-11-26
CN113704667B CN113704667B (en) 2023-06-27

Family

ID=78658432

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111017828.4A Active CN113704667B (en) 2021-08-31 2021-08-31 Automatic extraction processing method and device for bid announcement

Country Status (1)

Country Link
CN (1) CN113704667B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114462393A (en) * 2022-04-12 2022-05-10 安徽数智建造研究院有限公司 Webpage text information extraction method and device, terminal equipment and storage medium
CN114648393A (en) * 2022-05-19 2022-06-21 四川隧唐科技股份有限公司 Data mining method, system and equipment applied to bidding
CN115730121A (en) * 2022-11-14 2023-03-03 百思特管理咨询有限公司 Bidding information capture method based on software robot

Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060053154A1 (en) * 2004-09-09 2006-03-09 Takashi Yano Method and system for retrieving information based on manually-input keyword and automatically-selected keyword
WO2008036528A1 (en) * 2006-09-22 2008-03-27 Microsoft Corporation Recommending keywords based on bidding patterns
US20130073260A1 (en) * 2010-04-20 2013-03-21 Shunji Maeda Method for anomaly detection/diagnosis, system for anomaly detection/diagnosis, and program for anomaly detection/diagnosis
CN105718580A (en) * 2016-01-25 2016-06-29 江苏国泰新点软件有限公司 Method and device for providing bidding information search service
CN108229990A (en) * 2016-12-14 2018-06-29 北京奇虎科技有限公司 A kind of advertisement title generation method, device and equipment
CN108427721A (en) * 2018-02-08 2018-08-21 湖南慧集网络科技有限责任公司 A kind of standardized method of the information on bidding based on database and system
CN109615469A (en) * 2018-12-05 2019-04-12 贵阳高新数通信息有限公司 The management system and method extracted based on bidding website relevant information
CN110502680A (en) * 2019-08-27 2019-11-26 重庆大司空信息科技有限公司 A kind of abstracting method and device of acceptance of the bid bulletin relevant field
CN111095308A (en) * 2017-05-14 2020-05-01 数字推理***有限公司 System and method for quickly building, managing and sharing machine learning models
CN111506795A (en) * 2020-04-20 2020-08-07 北京中电普华信息技术有限公司 Bidding information acquisition method and device
CN111553779A (en) * 2020-06-04 2020-08-18 南京鑫智链科技信息有限公司 Method and device for sorting bid winning candidates, bid inviting terminal and storage medium
CN112667878A (en) * 2020-12-31 2021-04-16 平安国际智慧城市科技股份有限公司 Webpage text content extraction method and device, electronic equipment and storage medium
CN112685620A (en) * 2020-12-31 2021-04-20 山东奥邦交通设施工程有限公司 Bidding information processing method, system, readable storage medium and device
CN112906385A (en) * 2021-05-06 2021-06-04 平安科技(深圳)有限公司 Text abstract generation method, computer equipment and storage medium
CN113128218A (en) * 2021-04-27 2021-07-16 华世界数字科技(深圳)有限公司 Key field extraction method and device for bidding information

Patent Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060053154A1 (en) * 2004-09-09 2006-03-09 Takashi Yano Method and system for retrieving information based on manually-input keyword and automatically-selected keyword
WO2008036528A1 (en) * 2006-09-22 2008-03-27 Microsoft Corporation Recommending keywords based on bidding patterns
US20130073260A1 (en) * 2010-04-20 2013-03-21 Shunji Maeda Method for anomaly detection/diagnosis, system for anomaly detection/diagnosis, and program for anomaly detection/diagnosis
CN105718580A (en) * 2016-01-25 2016-06-29 江苏国泰新点软件有限公司 Method and device for providing bidding information search service
CN108229990A (en) * 2016-12-14 2018-06-29 北京奇虎科技有限公司 A kind of advertisement title generation method, device and equipment
CN111095308A (en) * 2017-05-14 2020-05-01 数字推理***有限公司 System and method for quickly building, managing and sharing machine learning models
CN108427721A (en) * 2018-02-08 2018-08-21 湖南慧集网络科技有限责任公司 A kind of standardized method of the information on bidding based on database and system
CN109615469A (en) * 2018-12-05 2019-04-12 贵阳高新数通信息有限公司 The management system and method extracted based on bidding website relevant information
CN110502680A (en) * 2019-08-27 2019-11-26 重庆大司空信息科技有限公司 A kind of abstracting method and device of acceptance of the bid bulletin relevant field
CN111506795A (en) * 2020-04-20 2020-08-07 北京中电普华信息技术有限公司 Bidding information acquisition method and device
CN111553779A (en) * 2020-06-04 2020-08-18 南京鑫智链科技信息有限公司 Method and device for sorting bid winning candidates, bid inviting terminal and storage medium
CN112667878A (en) * 2020-12-31 2021-04-16 平安国际智慧城市科技股份有限公司 Webpage text content extraction method and device, electronic equipment and storage medium
CN112685620A (en) * 2020-12-31 2021-04-20 山东奥邦交通设施工程有限公司 Bidding information processing method, system, readable storage medium and device
CN113128218A (en) * 2021-04-27 2021-07-16 华世界数字科技(深圳)有限公司 Key field extraction method and device for bidding information
CN112906385A (en) * 2021-05-06 2021-06-04 平安科技(深圳)有限公司 Text abstract generation method, computer equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
杜海舟;陈政波;钟孔露;: "基于上下文关系和TextRank算法的关键词提取方法", 上海电力学院学报, no. 06, pages 96 - 101 *
邱均平;王曰芬;颜端武;: "内容分析法研究与发展综述", 情报学进展, no. 00, pages 6 - 50 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114462393A (en) * 2022-04-12 2022-05-10 安徽数智建造研究院有限公司 Webpage text information extraction method and device, terminal equipment and storage medium
CN114648393A (en) * 2022-05-19 2022-06-21 四川隧唐科技股份有限公司 Data mining method, system and equipment applied to bidding
CN115730121A (en) * 2022-11-14 2023-03-03 百思特管理咨询有限公司 Bidding information capture method based on software robot

Also Published As

Publication number Publication date
CN113704667B (en) 2023-06-27

Similar Documents

Publication Publication Date Title
CN110502621B (en) Question answering method, question answering device, computer equipment and storage medium
CN111723215B (en) Device and method for establishing biotechnological information knowledge graph based on text mining
CN113704667B (en) Automatic extraction processing method and device for bid announcement
CN111783394A (en) Training method of event extraction model, event extraction method, system and equipment
CN110609998A (en) Data extraction method of electronic document information, electronic equipment and storage medium
CN113569050A (en) Method and device for automatically constructing government affair field knowledge map based on deep learning
CN110889275A (en) Information extraction method based on deep semantic understanding
CN115438162A (en) Knowledge graph-based disease question-answering method, system, equipment and storage medium
CN110162651B (en) News content image-text disagreement identification system and identification method based on semantic content abstract
CN117668180A (en) Document question-answering method, document question-answering device, and readable storage medium
CN116881425A (en) Universal document question-answering implementation method, system, device and storage medium
CN114579695A (en) Event extraction method, device, equipment and storage medium
CN113010593B (en) Event extraction method, system and device for unstructured text
CN110705285B (en) Government affair text subject word library construction method, device, server and readable storage medium
CN117095419A (en) PDF document data processing and information extracting device and method
CN112487263A (en) Information processing method, system, equipment and computer readable storage medium
CN113779218B (en) Question-answer pair construction method, question-answer pair construction device, computer equipment and storage medium
CN114067343A (en) Data set construction method, model training method and corresponding device
CN109597879B (en) Service behavior relation extraction method and device based on &#39;citation relation&#39; data
Altınel et al. Performance Analysis of Different Sentiment Polarity Dictionaries on Turkish Sentiment Detection
CN113962196A (en) Resume processing method and device, electronic equipment and storage medium
CN112015891A (en) Method and system for classifying messages of network inquiry platform based on deep neural network
CN111241827B (en) Attribute extraction method based on sentence retrieval mode
CN112181389B (en) Method, system and computer equipment for generating API (application program interface) marks of course fragments
CN117573851B (en) Automatic question-answering method and system for generating type in futures field

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant