CN117454880A - Webpage text verification method, device, equipment and storage medium - Google Patents

Webpage text verification method, device, equipment and storage medium Download PDF

Info

Publication number
CN117454880A
CN117454880A CN202311619251.3A CN202311619251A CN117454880A CN 117454880 A CN117454880 A CN 117454880A CN 202311619251 A CN202311619251 A CN 202311619251A CN 117454880 A CN117454880 A CN 117454880A
Authority
CN
China
Prior art keywords
result
text
dictionary
frequency
correction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311619251.3A
Other languages
Chinese (zh)
Inventor
田莎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Agricultural Bank of China
Original Assignee
Agricultural Bank of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Agricultural Bank of China filed Critical Agricultural Bank of China
Priority to CN202311619251.3A priority Critical patent/CN117454880A/en
Publication of CN117454880A publication Critical patent/CN117454880A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a webpage text verification method, device, equipment and storage medium. The method comprises the following steps: acquiring a webpage text to be checked; inputting the webpage text to be checked into a text correction model to obtain a first correction result corresponding to the webpage text to be checked, wherein the text correction model is obtained by iteratively training a first model through a first sample set; correcting the error of the first correction result based on the proper noun dictionary and the high-frequency dictionary in the financial field to obtain a second correction result; according to the knowledge graph and the target triples corresponding to the second correction result, the target correction result is determined, and through the technical scheme, the webpage text correction efficiency and the error rate can be improved.

Description

Webpage text verification method, device, equipment and storage medium
Technical Field
The embodiment of the invention relates to the technical field of computers, in particular to a webpage text verification method, device, equipment and storage medium.
Background
With the rapid development of network technology, web pages become an indispensable channel for financial institutions to issue and manage information, and users are also accustomed to acquiring information by browsing web pages. Websites released by financial institutions are of territory, and public information in websites needs to be able to be delivered to the public accurately in real time. However, due to the huge information data volume, the web page content released by the financial institutions is inevitably wrong, and the authority and accuracy of the web page content of the financial institutions are affected.
The verification mode of the website content of the financial institution in the prior art is generally through manual verification, the manual verification is time-consuming and labor-consuming, and due to the huge information data volume, errors and omission can occur in the manual verification, so that the verification efficiency of the webpage text is low, and the error rate is low.
Disclosure of Invention
The embodiment of the invention provides a webpage text verification method, device, equipment and storage medium, which can improve webpage text verification efficiency and error rate.
According to one aspect of the invention, there is provided a web page text verification method, including:
acquiring a webpage text to be checked;
inputting the webpage text to be checked into a text correction model to obtain a first correction result corresponding to the webpage text to be checked, wherein the text correction model is obtained by iteratively training a first model through a first sample set;
correcting the error of the first correction result based on the proper noun dictionary and the high-frequency dictionary in the financial field to obtain a second correction result;
and determining a target collation result according to the knowledge graph and the target triplet corresponding to the second collation result.
According to another aspect of the present invention, there is provided a web page text verification apparatus including:
The webpage text obtaining module to be verified is used for obtaining the webpage text to be verified;
the first correction result determining module is used for inputting the webpage text to be checked into a text correction model to obtain a first correction result corresponding to the webpage text to be checked, wherein the text correction model is obtained by iteratively training a first model through a first sample set;
the second correction result determining module is used for correcting the first correction result based on the proper noun dictionary and the high-frequency dictionary in the financial field to obtain a second correction result;
and the target correction result determining module is used for determining a target correction result according to the knowledge graph and the target triplet corresponding to the second correction result.
According to another aspect of the present invention, there is provided an electronic apparatus including:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,
the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the web page text verification method according to any one of the embodiments of the present invention.
According to another aspect of the present invention, there is provided a computer readable storage medium storing computer instructions for causing a processor to implement the web page text verification method according to any one of the embodiments of the present invention when executed.
According to the embodiment of the invention, the webpage text to be checked is obtained; inputting the webpage text to be checked into a text correction model to obtain a first correction result corresponding to the webpage text to be checked, wherein the text correction model is obtained by iteratively training a first model through a first sample set; correcting the error of the first correction result based on the proper noun dictionary and the high-frequency dictionary in the financial field to obtain a second correction result; and determining a target calibration result according to the knowledge graph and the target triplet corresponding to the second calibration result, so that the webpage text calibration efficiency and the error rate can be improved.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the invention or to delineate the scope of the invention. Other features of the present invention will become apparent from the description that follows.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a method for verifying web page text in an embodiment of the invention;
FIG. 2 is a schematic diagram of a web page text proofing process based on a Soft-Masked BERT model in an embodiment of the invention;
FIG. 3 is a flow chart of target candidate set generation in an embodiment of the invention;
FIG. 4 is a schematic diagram of a text collation process based on a knowledge graph in an embodiment of the invention;
FIG. 5 is a schematic diagram of a device for verifying text of a web page according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of an electronic device in an embodiment of the present invention.
Detailed Description
In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
It will be appreciated that prior to using the technical solutions disclosed in the embodiments of the present disclosure, the user should be informed and authorized of the type, usage range, usage scenario, etc. of the personal information related to the present disclosure in an appropriate manner according to the relevant legal regulations.
Example 1
Fig. 1 is a flowchart of a web page text verification method provided by an embodiment of the present invention, where the embodiment is applicable to a web page text verification, the method may be performed by a web page text verification device in the embodiment of the present invention, and the device may be implemented in a software and/or hardware manner, as shown in fig. 1, and the method specifically includes the following steps:
s110, acquiring the webpage text to be checked.
The webpage text to be checked can be related webpage text of a financial institution.
Specifically, the method for obtaining the text of the webpage to be checked may be: acquiring a financial field webpage to be verified; and carrying out page analysis on the to-be-checked financial field webpage to obtain a to-be-checked webpage text.
S120, inputting the webpage text to be checked into a text error correction model to obtain a first check result corresponding to the webpage text to be checked.
The text error correction model is obtained by iteratively training a first model through a first sample set.
Specifically, the way to iteratively train the first model through the first sample set may be: obtaining a first sample set, wherein the first sample set comprises: the error sentence sample and the correct sentence corresponding to the error sentence sample; inputting the error sentence samples in the first sample set into a first model to obtain a prediction correction result; and training parameters of the first model according to the prediction correction result and an objective function generated by a correct sentence corresponding to the error sentence sample until a text error correction model is obtained.
The first proofreading result corresponding to the webpage text to be checked is a proofreading result obtained after the text editing error is corrected.
S130, correcting the first correction result based on the proper noun dictionary and the high-frequency dictionary in the financial field to obtain a second correction result.
It should be noted that, the first correction result may be corrected based on the proper noun dictionary and the high-frequency dictionary in the financial field to obtain a second correction result, or the webpage text to be corrected may be corrected based on the proper noun dictionary and the high-frequency dictionary in the financial field to obtain an error correction result, and then the second correction result is determined according to the error correction result and the first correction result.
The obtaining mode of the proper noun dictionary may be: formulating crawler rules based on the element selector; and crawling proper nouns from the web pages in the financial field according to the crawler rules to obtain a proper noun dictionary.
The method for acquiring the high-frequency dictionary in the financial field can be as follows: obtaining a dataset, wherein the dataset comprises: web page text in the financial field; intercepting the webpage text of the financial field based on a sliding window with a preset length to obtain a character string set, wherein the length of each character string in the character string set is the preset length; acquiring word frequency corresponding to each character string in the character string set; and adding the character strings with word frequency larger than a word frequency threshold value in the character string set into a high-frequency dictionary in the financial field.
Specifically, the method for correcting the first correction result based on the proper noun dictionary and the high-frequency dictionary in the financial field to obtain the second correction result may be as follows: correcting the error of the first correction result based on a proper noun dictionary to obtain proper nouns with errors in the first correction result; and correcting the error of the first correction result based on a high-frequency dictionary in the financial field to obtain a high-frequency word with error in the first correction result. Determining a target candidate set corresponding to the proper noun with the error and a target candidate set corresponding to the high-frequency word with the error according to the target dictionary, the proper noun with the error in the first calibration result and the high-frequency word; and determining a second correction result according to the target candidate set corresponding to the proper noun with the error in the first correction result, the target candidate set corresponding to the high-frequency word with the error in the first correction result, the proper noun with the error in the first correction result, the high-frequency word with the error in the first correction result and the first correction result.
And S140, determining a target collation result according to the knowledge graph and the target triples corresponding to the second collation result.
It should be noted that, the target calibration result may be determined according to the knowledge graph and the target triplet corresponding to the second calibration result, the third calibration result may be determined according to the knowledge graph and the target triplet corresponding to the first calibration result, and then the target calibration result may be determined according to the second calibration result and the third calibration result.
Specifically, the method for determining the target collation result according to the knowledge graph and the target triplet corresponding to the second collation result may be as follows: creating a knowledge graph, wherein the knowledge graph comprises: at least two financial entities, attribute information of each financial entity and a relationship between the at least two financial entities; obtaining a target triplet corresponding to the second correction result; generating a target query statement according to the target triplet and a query statement template for querying the knowledge graph; executing the target query statement to obtain a return result; and if the returned result is error-free, determining the second correction result as a target correction result. The method for determining the target collation result according to the knowledge graph and the target triplet corresponding to the second collation result may further be: acquiring a triplet corresponding to the first checking result; generating a first query statement according to the triples corresponding to the first proofreading result and a query statement template for querying the knowledge graph; executing the first query statement to obtain a return result; and if the returned result is error-free, determining the first correction result and the second correction result as target correction results.
The embodiment of the invention analyzes the characteristics of the webpage text in the financial field, and classifies the webpage text errors into two types: text editing errors and knowledge representation errors. The method comprises the steps of providing a corresponding technical scheme respectively, continuously carrying out training iteration on a Soft-Masked BERT model based on self-constructed training corpus, realizing correction of text editing errors, constructing a knowledge graph based on the summarized entity and relationship type in the financial institution field by webpage text characteristics, further judging the association relationship among the entities by means of the reasoning capability of the knowledge graph, and pertinently solving some knowledge semantic errors in the financial institution field; and judging the consistency detection of the anchor text and the linked text content by using a threshold value.
Optionally, determining the target collation result according to the knowledge graph and the target triplet corresponding to the second collation result includes:
if the webpage text to be checked comprises hyperlinks, consistency detection is carried out on the webpage title and the text corresponding to the hyperlinks in the webpage, and a consistency detection result is obtained;
and determining a target calibration result according to the knowledge graph, the consistency detection result and the target triplet corresponding to the second calibration result.
Optionally, iteratively training the first model through the first sample set includes:
obtaining a first sample set, wherein the first sample set comprises: the error sentence sample and the correct sentence corresponding to the error sentence sample;
inputting the error sentence samples in the first sample set into a first model to obtain a prediction correction result;
and training parameters of the first model according to the prediction correction result and an objective function generated by a correct sentence corresponding to the error sentence sample until a text error correction model is obtained.
It should be noted that, the content of the web page text is mostly manually input or OCR recognition input, and in the process of inputting information, errors of shape and close words or homonyms often occur. The embodiment of the invention can utilize the text error correction model Soft-Masked BERT to correct spelling errors in text editing.
In a specific example, as shown in fig. 2, fig. 2 is a web text collation process based on a Soft-Masked BERT model, and the collation process mainly includes two parts of collation model training and collation model prediction. In the task of checking training models, the work of collecting corpus, text clauses and noise reduction is required to be completed. In order to better adapt to text error correction related to the field in a financial institution webpage, a financial institution related website can be crawled, text data is filtered, 80% of sentences are selected for replacement processing, the obtained sentence pairs are converted into text formats which can be used for model training, in order to add training data sets related to the financial institution field into a final training set, a (wrong sentences and correct sentences) parallel sentence construction is carried out by adopting a replacement-based method, and training is carried out after the sentence pairs are processed into formats required by a model. In the text proofreading stage, the cleaned text to be proofread is used as the input of a model, the result is predicted, and the proofreading result is given.
The Soft-Masked BERT model divides an error correction task into an error detection layer and an error correction layer, wherein the output of the error detection layer is used as the input of the error correction layer, and masking-embedding is added to each character feature in a Soft mode by taking the output of the error detection layer as a weight, so that the purpose of Soft-Masked is achieved.
Optionally, correcting the first correction result based on the proper noun dictionary and the high-frequency dictionary in the financial field to obtain a second correction result, including:
acquiring a proper noun dictionary and a high-frequency dictionary in the financial field;
identifying the first checking result to obtain proper nouns in the first checking result;
if the proper noun in the first proofreading result exists in the proper noun dictionary, determining that the proper noun in the first proofreading result passes the verification, and if the proper noun in the first proofreading result does not exist in the proper noun dictionary, determining that the proper noun in the first proofreading result has errors;
intercepting the first checking result based on a sliding window with a preset length to obtain a word candidate set to be checked;
sequentially detecting the dictionary existence of the high-frequency words in the word candidate set to be checked, if the high-frequency words in the word candidate set to be checked exist in the high-frequency dictionary in the financial field, checking the high-frequency words in the word candidate set to be checked, and if the high-frequency words in the word candidate set to be checked do not exist in the high-frequency dictionary in the financial field, judging that the high-frequency words in the word candidate set to be checked have errors;
Acquiring proper nouns and high-frequency words with errors in the first comparison result;
editing proper nouns and high-frequency words with errors in the first correction result according to a target dictionary to obtain an initial candidate set corresponding to the proper nouns and an initial candidate set corresponding to the high-frequency words;
screening the initial candidate set corresponding to the proper noun according to the proper noun dictionary to obtain a target candidate set corresponding to the proper noun with errors in the first correction result;
screening the initial candidate set corresponding to the high-frequency word according to the high-frequency dictionary in the financial field to obtain a target candidate set corresponding to the high-frequency word with errors in the first proofreading result;
and determining a second correction result according to the target candidate set corresponding to the proper noun with the error in the first correction result, the target candidate set corresponding to the high-frequency word with the error in the first correction result, the proper noun with the error in the first correction result, the high-frequency word with the error in the first correction result and the first correction result.
Wherein the target dictionary includes: homonym dictionary and homography dictionary, wherein, include in the homonym dictionary: homophonic different words may be, for example, apple and peace. The isomorphism dictionary includes: words with similar fonts can be, for example: and (3) the Chinese medicinal materials are already mixed.
Specifically, the method for editing the proper noun and the high-frequency word with the error in the first correction result according to the target dictionary to obtain the initial candidate set corresponding to the proper noun and the initial candidate set corresponding to the high-frequency word may be: and adding, deleting and modifying proper nouns and high-frequency words with errors in the first correction result according to the target dictionary to obtain an initial candidate set. Editing proper nouns and high-frequency words with errors in the first correction result according to a target dictionary, and obtaining an initial candidate set corresponding to the proper nouns and an initial candidate set corresponding to the high-frequency words may further be: modifying proper nouns and high-frequency words with errors in the first correction result according to the homonym dictionary and/or the homonym dictionary to obtain modified proper nouns and high-frequency words, and adding the modified proper nouns and high-frequency words to an initial candidate set; deleting the proper nouns with errors in the first correction result and the overlapped words in the high-frequency words, and adding the proper nouns and the high-frequency words after deleting the overlapped words to an initial candidate set; adding and/or reducing characters of the proper nouns with errors in the first correction result according to the proper noun dictionary to obtain modified proper nouns, adding and/or reducing characters of the high-frequency words with errors in the first correction result according to the high-frequency dictionary in the financial field to obtain modified high-frequency words; the modified proper nouns and high-frequency words are added to the initial candidate set.
Specifically, the method for screening the initial candidate set corresponding to the proper noun according to the proper noun dictionary to obtain the target candidate set corresponding to the proper noun with the error in the first calibration result may be: and sequentially judging whether proper nouns in the initial candidate set corresponding to the proper nouns exist in the proper noun dictionary, if so, adding the target candidate set, and if not, deleting.
Specifically, the method of screening the initial candidate set corresponding to the high-frequency word according to the high-frequency dictionary in the financial field to obtain the target candidate set corresponding to the high-frequency word with errors in the first calibration result is similar to that of the proper noun, and will not be described herein.
Specifically, the method for determining the second calibration result according to the target candidate set corresponding to the proper noun with the error in the first calibration result, the target candidate set corresponding to the high-frequency word with the error in the first calibration result, the proper noun with the error in the first calibration result and the high-frequency word with the error in the first calibration result, and the first calibration result may be: replacing proper nouns with errors in the first correction result with proper nouns in the target candidate set, and replacing high-frequency words with errors in the first correction result with high-frequency words in the target candidate set to obtain a second correction result.
In a specific example, correcting the first correction result based on the proper noun dictionary and the high-frequency dictionary in the financial domain to obtain a second correction result includes:
the detection of proper noun errors in the first calibration result is performed by the following steps:
step 1: and carrying out named entity recognition on the first checking result, wherein the embodiment of the invention uses a trained BiLSTM+CRF model to finish the labeling and recognition of proper nouns of specified types. Labeling results of "hedging funds" for example: [ B-STR I-STR I-STR E-STR ], B represents the beginning of entity name, E represents the end of entity name, STR expresses investment strategy entity, and the candidate set of proper nouns to be detected is obtained.
Step 2: and sequentially detecting dictionary existence of proper nouns in the candidate set, if the proper nouns exist, determining that the proper nouns have errors, and if the proper nouns do not exist, primarily determining that the proper nouns have errors.
The detection of the high-frequency word errors in the first checking result comprises the following steps:
step 1: setting a sliding window with the length of N (the length range of N is [2:6 ]), intercepting words with the length of N, and adding a candidate set to be detected.
Step 2: and sequentially carrying out dictionary existence detection on the high-frequency words in the candidate set, if the high-frequency words exist, then, no error exists, and if the high-frequency words do not exist, then, the word is preliminarily judged to exist in error.
The high-frequency word and proper noun are checked by the proper noun dictionary and the high-frequency dictionary in the finance field, as shown in fig. 3, for proper nouns with errors in the first checking result, adding, deleting and modifying are needed according to the principle that the editing distance is less than or equal to 2, so as to obtain an initial candidate set, sequentially judging whether proper nouns in the initial candidate set corresponding to the proper nouns exist in the proper noun dictionary, if so, adding a target candidate set, and if not, deleting.
It should be noted that, in a financial institution website, information such as business, products, market, and consultation news is frequently released, where description about entities such as financial institutions and event statements occurring between the entities are generally involved, so that accuracy of proper noun entities needs to be ensured, and accuracy of relationships between the entities needs to be ensured, so that correction of knowledge expression errors by using a dictionary-based method and a knowledge graph-based method can improve correction efficiency and error correction rate.
Optionally, acquiring a high-frequency word dictionary in the financial field includes:
obtaining a dataset, wherein the dataset comprises: web page text in the financial field;
Intercepting the webpage text of the financial field based on a sliding window with a preset length to obtain a character string set, wherein the length of each character string in the character string set is the preset length;
acquiring word frequency corresponding to each character string in the character string set;
and adding the character strings with word frequency larger than a word frequency threshold value in the character string set into a high-frequency dictionary in the financial field.
In a specific example, the embodiment of the invention cuts out a character string with the length of N from the webpage text of the financial field by setting a sliding window with the length of N to form a field dictionary, counts word frequencies (the length range of N is [2:6 ]), and adds the word frequencies into the field high-frequency word dictionary when the word frequencies reach the threshold value with the corresponding length according to the word: word frequency "is stored in the file. For example, a string of length 6 may occur less frequently, with a smaller threshold, and the thresholds set by literature references 2 to 6 are 350, 220, 150, 70, 40.
The embodiment of the invention constructs the high-frequency dictionary in the financial field to expand the coverage range of the dictionary and improve the accuracy of error checking and correcting.
Optionally, acquiring the proper noun dictionary includes:
formulating crawler rules based on the element selector;
And crawling proper nouns from the web pages in the financial field according to the crawler rules to obtain a proper noun dictionary.
It should be noted that, in the web page text in the financial institution field, the proper noun vocabulary includes many banking institutions, enterprises, industry categories, financial products, professions, etc., and proper nouns are crawled from a proper knowledge base by formulating crawler rules, so as to obtain a proper noun dictionary.
Optionally, obtaining the webpage text to be checked includes:
acquiring a financial field webpage to be verified;
and carrying out page analysis on the to-be-verified financial field webpage to obtain to-be-verified webpage text, webpage titles and hyperlinks in the webpage.
Optionally, determining a target collation result according to the knowledge graph and the target triplet corresponding to the second collation result, including:
creating a knowledge graph, wherein the knowledge graph comprises: at least two financial entities, attribute information of each financial entity and a relationship between the at least two financial entities;
obtaining a target triplet corresponding to the second correction result;
generating a target query statement according to the target triplet and a query statement template for querying the knowledge graph;
executing the target query statement to obtain a return result;
If the returned result is that no error exists, consistency detection is carried out on the webpage title and the text corresponding to the hyperlink in the webpage, and a consistency detection result is obtained;
and generating a target collation result according to the consistency detection result and the second collation result.
In a specific example, the embodiment of the present invention provides a text proofreading method based on a knowledge graph for a financial institution, as shown in fig. 4, which integrally includes two parts: building a domain webpage knowledge graph and checking texts based on the knowledge graph.
In the construction stage of the web page knowledge graph in the financial field, firstly, a conceptual mode of the knowledge graph is established by investigation and data analysis aiming at a field proofreading target, and then knowledge instance acquisition is carried out according to the field of the web page text to be proofread, so that knowledge support is provided for subsequent field text proofreading. The data sources may be from structured data such as business systems, domain internal dictionaries, and semi-structured data such as domain web text and encyclopedia knowledge. The embodiment of the invention adopts a knowledge graph query method to correct the error of the common knowledge in the field. The general term (E1, R, E2) here represents the extracted triplet, and the Cypher query statement is expressed as S: match (a) - [: R ] - (b) where a.name= 'E1' and b.name= 'E2' return RES, discussed in several cases:
(1) If res=r, no error type;
(2) If res= ", there may be a common sense error; then, a DFS graph searching algorithm is utilized for reasoning, if a correlation entity Em exists to correlate E1 with E2, no error type exists, otherwise, no relation label R exists between E1 and E2, common sense knowledge errors possibly exist, and searching or reasoning is carried out in a knowledge graph according to the entity E1 and the relation label R; if the prompt does not exist E1 or E2, proper noun errors may exist, an entity with a Levenshtein distance of 1 from E1 or E2 in the same entity type can be returned as a check result.
In addition, during the actual page content writing process, the hyperlink title may be inconsistent with the actual page content due to negligence of editing, and the embodiment of the invention assists in detecting the problem by means of knowledge extraction technology, and the overall flow includes:
(1) Through analysis, most of static pages corresponding to bulletins, news, product introduction and operation guide are terminated by htm, a labels in each page are crawled, url of a specified format in the webpage is obtained through filtering, corresponding title text1 is extracted, full-mode word segmentation is carried out, and Wt= { w1, w2, & gt, wn } is obtained.
(2) And entering url, wherein the page contains text content corresponding to the title.
And crawling text2 corresponding to the link, performing word segmentation on the text, extracting N key words in the article by using the TF-IDF, and dynamically setting the value of N as the number of the key words extracted by the title to obtain wc= { w1, w2, & gt, wn }.
And processing two word segmentation results Wt and Wc. Firstly deleting words with the number smaller than or equal to two words, customizing a stop word stock, wherein the word stock stores words with smaller influence on the theme, such as frequent, passing, appearing and the like, and filtering the stop words to obtain Wt 'and Wc'.
Comparing words in Wt 'and Wc' to obtain a calculation formula of the correlation Rel between the title and the text content, wherein the number C of words in Wc 'which are not contained in Wt':if Rel is less than the set threshold, the title and text content may not be coincident.
When the webpage text is calibrated based on the knowledge graph, firstly, page analysis is carried out on the webpage text in the field to be calibrated, the entity and the association relation are extracted, the consistency detection is carried out on the anchor text and the linked text content by combining the title and the Url, the knowledge query and reasoning technology is utilized for analysis, and whether the field vocabulary error and the knowledge error exist in the sentence is judged according to the analysis result.
The realization logic of the webpage text proofreading facing the financial institution field is designed as follows:
input: url of a financial institution
And (3) outputting: error analysis results (field knowledge error/title and text disagreement)
1: the crawling rule is designed to crawl pages under Url according to the appointed depth to obtain a set U;
2: storing in an elastic search database;
3:for u in U do;
4, obtaining html source codes corresponding to u;
5: text analysis results in text D, and a combination of title text t and Url (t, url);
6: extracting stated knowledge from the text D to obtain entity and triplet relation sets (H, R, T);
7:for(h,r,t)in(H,R,T)do;
8: checking based on the knowledge graph, and returning to the field knowledge error;
9:end for;
10:for(t,ur1)in(T,Url)do;
11: the consistency of the title and the text is checked, and the error of the title and the text is returned;
12:end for;
13:end for。
it should be noted that, the text in the web page in the financial institution domain may be classified from sources into news, bulletin, notification, introduction of related service products, service guide, etc. written by professional contributors, and messages, comments, feedback, etc. issued by people. In general, the website text in the financial institution domain has the following characteristics:
1. there are a number of proper nouns. Proper nouns in websites in the financial institution domain are frequently used and are more numerous, such as institution names, post names, financial product names, business names and the like.
2. The text field is clear. The text in the financial institution domain website is mostly described as a main body of related information of the domain.
3. The text is mostly in the form of a narrative. The news and the notices in the website are mainly described on the main information such as time, place, character and the like of a certain event, so that the news and the notices are concise and clear, and few congratulations are generated; the introduction and guidance are mostly objective descriptions, and the main and the subordinate structures are clear.
According to the embodiment of the invention, a model-based method is used for processing complex language phenomena, and then a knowledge representation type error correction is performed by using a knowledge graph-based method, so that the stability is improved. The correction method completes correction of wrongly written characters and knowledge expression errors which cannot be realized based on the model, and improves the performance of text correction.
According to the technical scheme, the webpage text to be checked is obtained; inputting the webpage text to be checked into a text correction model to obtain a first correction result corresponding to the webpage text to be checked, wherein the text correction model is obtained by iteratively training a first model through a first sample set; correcting the error of the first correction result based on the proper noun dictionary and the high-frequency dictionary in the financial field to obtain a second correction result; and determining a target calibration result according to the knowledge graph and the target triplet corresponding to the second calibration result, so that the webpage text calibration efficiency and the error rate can be improved.
Example two
Fig. 5 is a schematic structural diagram of a device for verifying web page text according to an embodiment of the present invention. The embodiment may be applicable to the case of web page text verification, and the device may be implemented in a software and/or hardware manner, and may be integrated in any device that provides a web page text verification function, as shown in fig. 5, where the web page text verification device specifically includes: the method comprises a webpage text acquisition module 510 to be checked, a first check result determining module 520, a second check result determining module 530 and a target check result determining module 540.
The webpage text obtaining module is used for obtaining the webpage text to be verified;
the first correction result determining module is used for inputting the webpage text to be checked into a text correction model to obtain a first correction result corresponding to the webpage text to be checked, wherein the text correction model is obtained by iteratively training a first model through a first sample set;
the second correction result determining module is used for correcting the first correction result based on the proper noun dictionary and the high-frequency dictionary in the financial field to obtain a second correction result;
and the target correction result determining module is used for determining a target correction result according to the knowledge graph and the target triplet corresponding to the second correction result.
The product can execute the method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.
Example III
Fig. 6 shows a schematic diagram of the structure of an electronic device 10 that may be used to implement an embodiment of the invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic equipment may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.
As shown in fig. 6, the electronic device 10 includes at least one processor 11, and a memory, such as a Read Only Memory (ROM) 12, a Random Access Memory (RAM) 13, etc., communicatively connected to the at least one processor 11, in which the memory stores a computer program executable by the at least one processor, and the processor 11 may perform various appropriate actions and processes according to the computer program stored in the Read Only Memory (ROM) 12 or the computer program loaded from the storage unit 18 into the Random Access Memory (RAM) 13. In the RAM 13, various programs and data required for the operation of the electronic device 10 may also be stored. The processor 11, the ROM 12 and the RAM 13 are connected to each other via a bus 14. An input/output (I/O) interface 15 is also connected to bus 14.
Various components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16 such as a keyboard, a mouse, etc.; an output unit 17 such as various types of displays, speakers, and the like; a storage unit 18 such as a magnetic disk, an optical disk, or the like; and a communication unit 19 such as a network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.
The processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, digital Signal Processors (DSPs), and any suitable processor, controller, microcontroller, etc. The processor 11 performs the various methods and processes described above, such as the web page text verification method.
In some embodiments, the web page text verification method may be implemented as a computer program tangibly embodied on a computer-readable storage medium, such as storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19. When the computer program is loaded into RAM 13 and executed by processor 11, one or more of the steps of the web page text verification method described above may be performed. Alternatively, in other embodiments, the processor 11 may be configured to perform the web page text verification method in any other suitable manner (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
A computer program for carrying out methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be implemented. The computer program may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. The computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) through which a user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.
The computing system may include clients and servers. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service are overcome.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present invention may be performed in parallel, sequentially, or in a different order, so long as the desired results of the technical solution of the present invention are achieved, and the present invention is not limited herein.
The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims (10)

1. The webpage text verification method is characterized by comprising the following steps of:
acquiring a webpage text to be checked;
inputting the webpage text to be checked into a text correction model to obtain a first correction result corresponding to the webpage text to be checked, wherein the text correction model is obtained by iteratively training a first model through a first sample set;
correcting the error of the first correction result based on the proper noun dictionary and the high-frequency dictionary in the financial field to obtain a second correction result;
And determining a target collation result according to the knowledge graph and the target triplet corresponding to the second collation result.
2. The method of claim 1, wherein iteratively training the first model through the first set of samples comprises:
obtaining a first sample set, wherein the first sample set comprises: the error sentence sample and the correct sentence corresponding to the error sentence sample;
inputting the error sentence samples in the first sample set into a first model to obtain a prediction correction result;
and training parameters of the first model according to the prediction correction result and an objective function generated by a correct sentence corresponding to the error sentence sample until a text error correction model is obtained.
3. The method of claim 1, wherein correcting the first correction result based on the proper noun dictionary and the finance domain high-frequency dictionary to obtain a second correction result comprises:
acquiring a proper noun dictionary and a high-frequency dictionary in the financial field;
identifying the first checking result to obtain proper nouns in the first checking result;
if the proper noun in the first proofreading result exists in the proper noun dictionary, determining that the proper noun in the first proofreading result passes the verification, and if the proper noun in the first proofreading result does not exist in the proper noun dictionary, determining that the proper noun in the first proofreading result has errors;
Intercepting the first checking result based on a sliding window with a preset length to obtain a word candidate set to be checked;
sequentially detecting the dictionary existence of the high-frequency words in the word candidate set to be checked, if the high-frequency words in the word candidate set to be checked exist in the high-frequency dictionary in the financial field, checking the high-frequency words in the word candidate set to be checked, and if the high-frequency words in the word candidate set to be checked do not exist in the high-frequency dictionary in the financial field, judging that the high-frequency words in the word candidate set to be checked have errors;
acquiring proper nouns and high-frequency words with errors in the first comparison result;
editing proper nouns and high-frequency words with errors in the first correction result according to a target dictionary to obtain an initial candidate set corresponding to the proper nouns and an initial candidate set corresponding to the high-frequency words;
screening the initial candidate set corresponding to the proper noun according to the proper noun dictionary to obtain a target candidate set corresponding to the proper noun with errors in the first correction result;
screening the initial candidate set corresponding to the high-frequency word according to the high-frequency dictionary in the financial field to obtain a target candidate set corresponding to the high-frequency word with errors in the first proofreading result;
And determining a second correction result according to the target candidate set corresponding to the proper noun with the error in the first correction result, the target candidate set corresponding to the high-frequency word with the error in the first correction result, the proper noun with the error in the first correction result, the high-frequency word with the error in the first correction result and the first correction result.
4. The method of claim 3, wherein obtaining a financial domain hyperfrequency word dictionary comprises:
obtaining a dataset, wherein the dataset comprises: web page text in the financial field;
intercepting the webpage text of the financial field based on a sliding window with a preset length to obtain a character string set, wherein the length of each character string in the character string set is the preset length;
acquiring word frequency corresponding to each character string in the character string set;
and adding the character strings with word frequency larger than a word frequency threshold value in the character string set into a high-frequency dictionary in the financial field.
5. A method according to claim 3, wherein obtaining a proper noun dictionary comprises:
formulating crawler rules based on the element selector;
and crawling proper nouns from the web pages in the financial field according to the crawler rules to obtain a proper noun dictionary.
6. The method of claim 1, wherein obtaining the web page text to be verified comprises:
acquiring a financial field webpage to be verified;
and carrying out page analysis on the to-be-verified financial field webpage to obtain to-be-verified webpage text, webpage titles and hyperlinks in the webpage.
7. The method of claim 6, wherein determining a target collation result from the knowledge-graph and the target triplet corresponding to the second collation result comprises:
creating a knowledge graph, wherein the knowledge graph comprises: at least two financial entities, attribute information of each financial entity and a relationship between the at least two financial entities;
obtaining a target triplet corresponding to the second correction result;
generating a target query statement according to the target triplet and a query statement template for querying the knowledge graph;
executing the target query statement to obtain a return result;
if the returned result is that no error exists, consistency detection is carried out on the webpage title and the text corresponding to the hyperlink in the webpage, and a consistency detection result is obtained;
and generating a target collation result according to the consistency detection result and the second collation result.
8. A web page text verification device, comprising:
the webpage text obtaining module to be verified is used for obtaining the webpage text to be verified;
the first correction result determining module is used for inputting the webpage text to be checked into a text correction model to obtain a first correction result corresponding to the webpage text to be checked, wherein the text correction model is obtained by iteratively training a first model through a first sample set;
the second correction result determining module is used for correcting the first correction result based on the proper noun dictionary and the high-frequency dictionary in the financial field to obtain a second correction result;
and the target correction result determining module is used for determining a target correction result according to the knowledge graph and the target triplet corresponding to the second correction result.
9. An electronic device, the electronic device comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,
the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the web page text verification method of any one of claims 1-7.
10. A computer readable storage medium storing computer instructions for causing a processor to perform the web page text verification method of any one of claims 1-7.
CN202311619251.3A 2023-11-29 2023-11-29 Webpage text verification method, device, equipment and storage medium Pending CN117454880A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311619251.3A CN117454880A (en) 2023-11-29 2023-11-29 Webpage text verification method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311619251.3A CN117454880A (en) 2023-11-29 2023-11-29 Webpage text verification method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN117454880A true CN117454880A (en) 2024-01-26

Family

ID=89581912

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311619251.3A Pending CN117454880A (en) 2023-11-29 2023-11-29 Webpage text verification method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117454880A (en)

Similar Documents

Publication Publication Date Title
US20210157975A1 (en) Device, system, and method for extracting named entities from sectioned documents
US20230142217A1 (en) Model Training Method, Electronic Device, And Storage Medium
US9189473B2 (en) System and method for resolving entity coreference
US7983903B2 (en) Mining bilingual dictionaries from monolingual web pages
US10762293B2 (en) Using parts-of-speech tagging and named entity recognition for spelling correction
US10417335B2 (en) Automated quantitative assessment of text complexity
CN113033185B (en) Standard text error correction method and device, electronic equipment and storage medium
US20180181559A1 (en) Utilizing user-verified data for training confidence level models
CN113986864A (en) Log data processing method and device, electronic equipment and storage medium
CN114692628A (en) Sample generation method, model training method, text extraction method and text extraction device
CN113535813A (en) Data mining method and device, electronic equipment and storage medium
CN113836316B (en) Processing method, training method, device, equipment and medium for ternary group data
US8224642B2 (en) Automated identification of documents as not belonging to any language
Wong et al. iSentenizer‐μ: Multilingual Sentence Boundary Detection Model
US10706369B2 (en) Verification of information object attributes
US20190065453A1 (en) Reconstructing textual annotations associated with information objects
CN112560425A (en) Template generation method and device, electronic equipment and storage medium
CN110705285B (en) Government affair text subject word library construction method, device, server and readable storage medium
Sreejith et al. N-gram based algorithm for distinguishing between Hindi and Sanskrit texts
CN115034209A (en) Text analysis method and device, electronic equipment and storage medium
Naemi et al. Informal-to-formal word conversion for persian language using natural language processing techniques
CN117454880A (en) Webpage text verification method, device, equipment and storage medium
JP6623840B2 (en) Synonym detection device, synonym detection method, and computer program for synonym detection
Li et al. A unified model for solving the OOV problem of chinese word segmentation
CN111581950A (en) Method for determining synonym and method for establishing synonym knowledge base

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination