CN113704667A

CN113704667A - Automatic extraction processing method and device for bidding announcement

Info

Publication number: CN113704667A
Application number: CN202111017828.4A
Authority: CN
Inventors: 姚从磊; 陈浩
Original assignee: Beijing Bailian Intelligent Technology Co ltd
Current assignee: Beijing Bailian Intelligent Technology Co ltd
Priority date: 2021-08-31
Filing date: 2021-08-31
Publication date: 2021-11-26
Anticipated expiration: 2041-08-31
Also published as: CN113704667B

Abstract

The application discloses a method and a device for automatically extracting and processing a bid announcement, wherein the method comprises the following steps: capturing a webpage according to the webpage address; extracting text contents related to the webpage from the webpage information, wherein the text contents are extracted according to corresponding tags in an HTML (hypertext markup language) language used by the webpage, the text contents are displayed in the webpage, and the text contents are obtained by splicing texts corresponding to the tags according to the sequence of the tags appearing in a webpage source code; determining the text content as a bidding announcement from the text content; and acquiring keywords from the text content and determining the corresponding target object in the bid announcement according to the keywords, wherein the keywords are used for indicating the target object and are configured in advance. Through the method and the device, the problem that the bidding announcement needs to be acquired manually in the prior art is solved, so that the acquisition efficiency of the bidding announcement is improved, and the bidding announcement is prevented from being missed.

Description

Automatic extraction processing method and device for bidding announcement

Technical Field

The application relates to the field of webpage text processing, in particular to an automatic extraction processing method and device for a bidding announcement.

Background

The bid announcement is generally issued by government agencies, enterprises and institutions, intermediary agencies and the like participating in the bid process among various links of the bid activity in a website of the owner or a special medium website of a third party, and is used for disclosing key events, important information and the like in the bid activity. There are tens of thousands of websites that issue bidding announcements throughout the country. The average total number of announcements released per day is over 10 million.

The target object is a commodity or a service required to be purchased by a buyer and a tenderer in a bidding activity, and is information having a high commercial value in a bidding announcement. The variety of the object is very rich, and the expression is various, wherein professional words in specific fields are not lacked. It is valuable to quickly and accurately find out the specific target objects concerned by the user from the numerous and complicated target objects in the mass bidding bulletins.

At present, the bidding bulletin is obtained manually, and the processing method consumes a large amount of human resources and is easy to cause omission of the bidding bulletin.

Disclosure of Invention

The embodiment of the application provides an automatic extraction processing method and device of a bidding announcement, and aims to at least solve the problem caused by the need of manually acquiring the bidding announcement in the prior art.

According to an aspect of the present application, there is provided an automatic extraction processing method of a bid announcement, including: capturing a webpage according to the webpage address; extracting text content related to the webpage from the webpage information, wherein the text content is extracted according to a corresponding label in an HTML language used by the webpage, the text content is displayed in the webpage, and the text content is obtained by splicing texts corresponding to the label according to the sequence of the label appearing in the webpage source code; determining the text content as a bidding announcement from the text content; and acquiring keywords from the text content and determining a corresponding target object in the bid announcement according to the keywords, wherein the keywords are used for indicating the target object, and the keywords are configured in advance.

Further, crawling the web page according to the web page address comprises: receiving at least one address of a website or webpage configured by a user; and grabbing the webpage according to the at least one address according to a preset period.

Further, extracting the text content related to the web page according to the corresponding tag in the HTML language used by the web page includes: acquiring first tag content used for indicating text content in the HTML language; extracting texts in the first label content; acquiring second label content for indicating the format of the text; and setting the format of the text according to the second label content to obtain the text content.

Further, extracting the text content related to the web page according to the corresponding tag in the HTML language used by the web page includes: acquiring a third label used for indicating a form in the webpage; extracting contents in the rows and the columns according to the indicated rows and columns in the third label, and arranging the contents into text contents in a predetermined format according to the rows and the columns, wherein the predetermined format is configured in advance.

Further, extracting the text content related to the web page according to the corresponding tag in the HTML language used by the web page includes: acquiring a fourth label used for indicating that a file is embedded in the webpage; and acquiring the type of the file in the fourth label, and calling a corresponding character extraction tool according to the type of the file to extract the text content from the file.

According to another aspect of the present application, there is also provided an automatic extraction processing apparatus of a bid notice, including: the grabbing module is used for grabbing the webpage according to the webpage address; the extraction module is used for extracting text content related to the webpage from the webpage information, wherein the text content is extracted according to a corresponding label in an HTML language used by the webpage, the text content is displayed in the webpage, and the text content is obtained by splicing texts corresponding to the label according to the sequence of the label appearing in the webpage source code; the first determining module is used for determining the text content as a bidding announcement from the text content; a second determining module, configured to obtain a keyword from the text content and determine a corresponding target object in the bid announcement according to the keyword, where the keyword is used to indicate the target object, and the keyword is preconfigured.

Further, the grasping module is configured to: receiving at least one address of a website or webpage configured by a user; and grabbing the webpage according to the at least one address according to a preset period.

Further, the extraction module is configured to: acquiring first tag content used for indicating text content in the HTML language; extracting texts in the first label content; acquiring second label content for indicating the format of the text; and setting the format of the text according to the second label content to obtain the text content.

Further, the extraction module is configured to: acquiring a third label used for indicating a form in the webpage; extracting contents in the rows and the columns according to the indicated rows and columns in the third label, and arranging the contents into text contents in a predetermined format according to the rows and the columns, wherein the predetermined format is configured in advance.

Further, the extraction module is configured to: acquiring a fourth label used for indicating that a file is embedded in the webpage; and acquiring the type of the file in the fourth label, and calling a corresponding character extraction tool according to the type of the file to extract the text content from the file.

In the embodiment of the application, the method comprises the steps of grabbing a webpage according to a webpage address; extracting text content related to the webpage from the webpage information, wherein the text content is extracted according to a corresponding label in an HTML language used by the webpage, the text content is displayed in the webpage, and the text content is obtained by splicing texts corresponding to the label according to the sequence of the label appearing in the webpage source code; determining the text content as a bidding announcement from the text content; and acquiring keywords from the text content and determining a corresponding target object in the bid announcement according to the keywords, wherein the keywords are used for indicating the target object, and the keywords are configured in advance. Through the method and the device, the problem that the bidding announcement needs to be acquired manually in the prior art is solved, so that the acquisition efficiency of the bidding announcement is improved, and the bidding announcement is prevented from being missed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application. In the drawings:

fig. 1 is a flowchart of an automatic extraction processing method of a bid announcement according to an embodiment of the present application.

Detailed Description

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowcharts, in some cases, the steps illustrated or described may be performed in an order different than presented herein.

In the present embodiment, an automatic extraction processing method of a bidding announcement is provided, and fig. 1 is a flowchart of an automatic extraction processing method of a bidding announcement according to an embodiment of the present application, and as shown in fig. 1, the flowchart includes the following steps:

step S102, capturing a webpage according to a webpage address;

in this step, at least one address of a website or web page configured by a user may be received; and grabbing the webpage according to the at least one address according to a preset period.

Step S104, extracting text content related to the webpage from the webpage information, wherein the text content is extracted according to a corresponding label in an HTML language used by the webpage, the text content is displayed in the webpage, and the text content is obtained by splicing texts corresponding to the label according to the sequence of the label appearing in the webpage source code;

step S106, determining the text content as a bidding announcement from the text content;

the step may also be implemented by machine learning, and a third machine learning model may be trained, where the model is obtained by using multiple sets of third training data, and each set of third training data includes input data and output data, where the input data is a text content, and the output data is used to identify whether the text content is a label of a bidding announcement. After training, the third machine learning model can be used, the text content is input, the third machine learning model can output whether the text content is a bidding announcement, and if so, step S108 is executed.

Step S108, obtaining keywords from the text content and determining corresponding target objects in the bid announcement according to the keywords, wherein the keywords are used for indicating the target objects, and the keywords are configured in advance.

The step may also be implemented by machine learning, and a second machine learning model may be trained, where the model is obtained by training using multiple sets of second training data, and each set of second training data includes input data and output data, where the input data is a text content, and the output data is a target object. After training, the second machine learning model may be used, and the second machine learning model may output the subject matter upon inputting the textual content.

In another alternative, the keywords of the context of the output target object are acquired and saved, all the keywords saved first are searched in the text content, that is, the keywords of all the contexts saved are taken as the keywords to be acquired in step S108, and then the target object can be found in the bid notice according to the acquired keywords.

The target object output by the second machine learning module and the target object found according to the keyword can be compared, if the target object is consistent with the keyword, the target object is found successfully, and if the target object is inconsistent with the keyword, the target object found is displayed to the user.

Through the steps, the problem caused by the fact that the bidding announcement needs to be acquired manually in the prior art is solved, so that the acquisition efficiency of the bidding announcement is improved, and the bidding announcement is prevented from being missed.

Extracting text content can be extracted according to the type of the tag, for example, a first tag content used for indicating the text content in the HTML language can be obtained; extracting texts in the first label content; acquiring second label content for indicating the format of the text; and setting the format of the text according to the second label content to obtain the text content.

For another example, a third tag for indicating a form in the web page is obtained; extracting contents in the rows and the columns according to the indicated rows and columns in the third label, and arranging the contents into text contents in a predetermined format according to the rows and the columns, wherein the predetermined format is configured in advance.

In an alternative embodiment, the text content may be extracted from the form using a first machine learning model trained using a plurality of sets of training data, each set of training data including input data and output data, the input data being source codes of an HTML web page including the form, and the output data being text content obtained by arranging text in the form in a predetermined format. After training, the first machine learning model is used, the web page in HTML format is input into the first machine learning model, and the text content is output from the second machine learning model.

For another example, the extraction module is configured to: acquiring a fourth label used for indicating that a file is embedded in the webpage; and acquiring the type of the file in the fourth label, and calling a corresponding character extraction tool according to the type of the file to extract the text content from the file.

This is described below in connection with an alternative embodiment.

In the optional embodiment, the format of the bidding announcement mainly takes an HTML webpage as a main part, and meanwhile, the formats released by a large number of websites are PDF, pictures, Flash and the like. Announcements typically do not have a fixed format. The differences of the bulletin writer in terms of wording, computer proficiency and the like and the technical and tool differences of the publishing platforms of all the people lead to great differences in various aspects of the expression mode of the bidding bulletin in characters, the presentation sequence logic of information, the rigor degree of data, the standardization degree of tables, the format of display and the like. The present embodiment provides a solution.

The present embodiment can be introduced from two aspects, the first aspect is to extract the subject matter from the bidding announcement, and firstly, it is necessary to obtain the effective announcement text content from the announcement files with different typesetting formats. The method for extracting the plain text from the HTML webpage mainly comprises the steps of completely deleting an HTML tag based on a rule, calling the library of the embodiment and directly extracting the text.

The process of this embodiment from the original bid post to plain text is as follows:

obtaining valid text from HTML, the direct deletion of HTML tags results in a substantially complete loss of paragraph information. The conversion function provided by the present embodiment is able to achieve relatively correct results for the organization. In addition, tables are generally used in bulletins to show list information, and at present, there is no good and general method for analyzing table data in this scenario. This may result in a misalignment of the information, causing difficulty and interference in extracting the subject matter. The invention greatly improves the extraction of the text information with correct expression.

In the embodiment, in the part of converting HTML into text for a bid and bid notice, an analyzer specially designed for the bid and bid notice is adopted, the main logic is to analyze from top to bottom based on a DOM tree, and special processing logics of special tags are defined, such as < i >, </b > and other tags which do not affect the line logic, the tags are deleted, for < br > tags, line feed or line feed plus empty lines are processed according to the difference between the front tag and the back tag and the parent node tag, especially for < table > tags, special processing is performed, a deep learning method is adopted to obtain header keyword information in a table, and then the header keyword information is converted into a text which is easy to process by a computer according to different table structures.

The extraction of plain text from HTML in this embodiment is performed for one example as follows: firstly, cleaning an HTML text, and the steps roughly comprise: deleting special characters, such as: <200d >, \\ u200d replaces the toned Latin character with a similar character without tones: normalize ('NFKD', text); deleting the emoji character; converting traditional Chinese characters into simplified Chinese characters; clean up non-essential tags in HTML, such as: < script/>, < noscript/>, < style/>, < block/> …; repairing the wrong < table/> tag, mainly correcting the missing or redundant < tr/>, < th/>, < td/>, and the wrong colspan and rowspan; then, sequentially analyzing elements in the HTML by adopting a front-end traversal mode, customizing a processor aiming at a specific HTML label, such as an Imageprocessor aiming at < image/> and extracting characters in the picture by adopting an OCR technology; for the UlProcessor of < ul/> the list is processed as a whole; aiming at the BrProcessor of < br/> and combining a context label to judge whether an additional empty line needs to be added or not so as to embody the segmentation between paragraphs; and specially processing the table labels aiming at the TableProcessor of < table/> so that the converted text can be suitable for calculation and processing to the maximum extent.

The TableProcessor in this embodiment may perform the following steps: firstly, analyzing the label and converting the label into a value object Table, wherein the structural main body of the value object Table is a two-dimensional array corresponding to each cell in the Table, and each array originally comprises characters in the cell of the Table and colspan and rowspan information of the cell. And whether it is a merged cell (including horizontal and vertical). A deep learning model is then built to determine which cells are headers. The processor is then designed in conjunction with the layout structure of the form itself. Let K denote the header and V denote the non-header, i.e. the value corresponding to the header. Typical arrangements are as follows: KVKV: this form needs to be divided into two types, one is that this case needs no special treatment, only needs to be pieced together in turn by using reasonable separators, such as K: v, K: v; another situation is that the table content is divided into two parts, the left part and the right part, and the two parts respectively represent similar attributes of two different mechanisms, and when the two parts are processed, half of the table content needs to be processed first, and then the other half of the table content needs to be processed. The treatment method is also K: and V splicing. The first row is all K, and any multiple of the following rows are all V: starting from the second line, the iteration is organized into a plurality of lines K1: v1, K2: v2, … the first row and column are all K, the remainder are V: and splicing by taking K in the first row as a row key and K in the first column as a field key, wherein the splicing comprises the following steps: row K, column K1: v1, K column 2: v2, ….

The second aspect is to extract the subject matter from the text, and the core part belongs to the keyword extraction.

The method comprises the steps of extracting a target part from a pure text, specially collecting and sorting a batch of target part word lists of various industries aiming at the problem that the target part word lists are quite diverged in distribution, and normalizing the target part according to the target part word lists on training data to reduce sparse distribution. And in the prediction stage, the target object extracted by the model is compared with the target object vocabulary through a maximum matching rule to expand or cut the target object extracted by the model.

The key word extraction mainstream scheme includes a rule-based extraction method, a statistics-based extraction method, and a machine learning-based extraction method.

The rule-based method is to manually set a word expression rule, and information conforming to the expression rule can be extracted. Regular expressions are generally employed. The advantage is that the expression that meets a specific rule must or must not be extracted, and the disadvantage is that the type of expression itself is unpredictable, requiring a lot of rules to be written, requiring a lot of manual summaries, where there is also the possibility of conflicts between the rules. For the subject matter, recall rates are quite low due to the distribution diversity of its expression.

The statistical-based method is to calculate the weight of the word in the document based on statistical information, such as word frequency, and extract keywords according to the weighted ranking. The main implementations are TF-IDF [1] and TextRank [2 ]. The TF-IDF method obtains word weight by calculating word Frequency (TF) and Inverse text Frequency Index (IDF); the TextRank method is based on the idea of PageRank, a co-occurrence network is constructed through a word co-occurrence window, and word scores are calculated. The method is simple and easy to implement and has strong applicability. The statistical-based method needs to be combined with other processing to complete the object extraction problem, for example, sentences possibly containing objects need to be located, and in the sentence sets, part-of-speech tagging is combined for extraction. This step of locating sentences that may contain the subject matter may not necessarily yield reliable results, and may therefore yield unnecessary results or may miss useful information.

The machine learning-based method comprises supervised learning methods such as SVM and naive Bayes, and unsupervised learning methods such as K-means and hierarchical clustering. In such methods, the quality of the model depends on feature extraction, and deep learning is an effective way of feature extraction. The Word2Vec Word vector model, introduced by Google, is a representative learning tool in the field of natural language. It maps the dictionary to a more abstract vector space in the process of training the language model, each word is represented by a high-dimensional vector, and the distance between two points in the vector space corresponds to the similarity degree of the two words. The better performing Bert model was still derived by Google later, and can be used for training and keyword extraction in this embodiment. The method based on machine learning is a method with better effect at present. However, the data used for training the Bert model has a low degree of coincidence with the distribution of the target objects in the bidding field, and the distribution of the target objects is added, so that the performance of the target objects is poor, and especially word segmentation errors occur, such as more words and fewer words, so that the extracted target objects are not legal phrases, real goods or services, or goods or services faithful to the original announcements.

In the embodiment, a named entity recognition method in deep learning is adopted to extract the object. A common model for named entity recognition is Bert (bidirectional Encoder reproduction from transformations) + LSTM (Long Short Term memory) + CRF (conditional Random field). The Bert serves as a universal pre-training model and plays a role of embedding, so that characters have high expression capacity in the model. Bert performs well in various downstream tasks, but has a cost disadvantage for large-scale use due to its large resource requirement. The LSTM can capture the sequence relation of the text front and back, is very suitable for processing the sequence problem, but cannot be parallelized, and has slightly insufficient performance. The CRF is used for modeling the finally output prediction sequence and ensuring the rationality of the prediction sequence.

In this example, a model of Electrora + Transformamer Encoder + CRF was used. Electra is a variant of Bert, which uses structural training similar to that used to generate countermeasure networks to weigh the size and performance of the model, and is more suitable for the use of massive data on a large scale. The Transformer Encoder is an Encoder part in a Transformer structure and is used for increasing the fitting capacity of a model.

In the embodiment, the enhancement of the target object vocabularies is also added, and a certain number of target object vocabularies which are subjected to error investigation are directly added into the word list of the Electrora to avoid the problem of error phrase boundaries caused by the sparsity of the target object words.

The Electra mentioned in this embodiment plays a role of Embedding Module in the model of Electra + Transformamer Encoder + CRF, and is responsible for mapping the text into meaningful matrix, and then sending to the next layer of Transformar, where only the Encode part in the Transformar is used, and then sending its result to the next layer of CRF, and the result of CRF can give which words are concatenated into strings as the target.

In this embodiment, a deep learning classification model is used to find header information in an HTML table, and an algorithm is used to analyze a correct text. In this embodiment, the target object vocabulary is further used to normalize the target object extracted by the deep learning model.

In this embodiment, an electronic device is provided, comprising a memory in which a computer program is stored and a processor configured to run the computer program to perform the method in the above embodiments.

The programs described above may be run on a processor or may also be stored in memory (or referred to as computer-readable media), which includes both non-transitory and non-transitory, removable and non-removable media, that implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

These computer programs may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks, and corresponding steps may be implemented by different modules.

Such an apparatus and system is provided in this embodiment. The device is called an automatic extraction processing device of the bidding bulletin, and comprises: the grabbing module is used for grabbing the webpage according to the webpage address; the extraction module is used for extracting text content related to the webpage from the webpage information, wherein the text content is extracted according to a corresponding label in an HTML language used by the webpage, the text content is displayed in the webpage, and the text content is obtained by splicing texts corresponding to the label according to the sequence of the label appearing in the webpage source code; the first determining module is used for determining the text content as a bidding announcement from the text content; a second determining module, configured to obtain a keyword from the text content and determine a corresponding target object in the bid announcement according to the keyword, where the keyword is used to indicate the target object, and the keyword is preconfigured.

The system or the apparatus is used for implementing the functions of the method in the foregoing embodiments, and each module in the system or the apparatus corresponds to each step in the method, which has been described in the method and is not described herein again.

For example, the grasping module is configured to: receiving at least one address of a website or webpage configured by a user; and grabbing the webpage according to the at least one address according to a preset period.

In the embodiment, in the process from the bidding announcement HTML webpage to the plain text, special processing is performed on the bidding announcement, so that the obtained plain text is more beneficial for a computer to extract the target object from the page. Especially, the text obtained by conversion retains the information with complete semantics aiming at the special processing of the table information; the standardization and unification of the subject matter are processed, so that the subject matter is less prone to error, missing, multiple characters and few characters.

The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. An automatic extraction processing method of a bidding announcement is characterized by comprising the following steps:

capturing a webpage according to the webpage address;

extracting text content related to the webpage from the webpage information, wherein the text content is extracted according to a corresponding label in an HTML language used by the webpage, the text content is displayed in the webpage, and the text content is obtained by splicing texts corresponding to the label according to the sequence of the label appearing in the webpage source code;

determining the text content as a bidding announcement from the text content;

and acquiring keywords from the text content and determining a corresponding target object in the bid announcement according to the keywords, wherein the keywords are used for indicating the target object, and the keywords are configured in advance.

2. The method of claim 1, wherein crawling the web page according to the web page address comprises:

receiving at least one address of a website or webpage configured by a user;

and grabbing the webpage according to the at least one address according to a preset period.

3. The method of claim 1, wherein extracting the text content related to the web page according to the corresponding tag in the HTML language used by the web page comprises:

acquiring first tag content used for indicating text content in the HTML language;

extracting texts in the first label content;

acquiring second label content for indicating the format of the text;

and setting the format of the text according to the second label content to obtain the text content.

4. The method according to any one of claims 1 to 3, wherein extracting the text content related to the web page according to the corresponding tag in the HTML language used by the web page comprises:

acquiring a third label used for indicating a form in the webpage;

extracting contents in the rows and the columns according to the indicated rows and columns in the third label, and arranging the contents into text contents in a predetermined format according to the rows and the columns, wherein the predetermined format is configured in advance.

5. The method according to any one of claims 1 to 3, wherein extracting the text content related to the web page according to the corresponding tag in the HTML language used by the web page comprises:

acquiring a fourth label used for indicating that a file is embedded in the webpage;

and acquiring the type of the file in the fourth label, and calling a corresponding character extraction tool according to the type of the file to extract the text content from the file.

6. An automatic extraction processing apparatus of a bid notice, comprising:

the grabbing module is used for grabbing the webpage according to the webpage address;

the extraction module is used for extracting text content related to the webpage from the webpage information, wherein the text content is extracted according to a corresponding label in an HTML language used by the webpage, the text content is displayed in the webpage, and the text content is obtained by splicing texts corresponding to the label according to the sequence of the label appearing in the webpage source code;

the first determining module is used for determining the text content as a bidding announcement from the text content;

a second determining module, configured to obtain a keyword from the text content and determine a corresponding target object in the bid announcement according to the keyword, where the keyword is used to indicate the target object, and the keyword is preconfigured.

7. The apparatus of claim 6, wherein the grasping module is to:

receiving at least one address of a website or webpage configured by a user;

8. The apparatus of claim 6, wherein the extraction module is configured to:

extracting texts in the first label content;

acquiring second label content for indicating the format of the text;

9. The apparatus of any one of claims 6 to 8, wherein the extraction module is configured to:

acquiring a third label used for indicating a form in the webpage;

10. The apparatus of any one of claims 6 to 8, wherein the extraction module is configured to: