CN114417812A - Text checking method, device, equipment and storage medium - Google Patents

Text checking method, device, equipment and storage medium Download PDF

Info

Publication number
CN114417812A
CN114417812A CN202210249958.9A CN202210249958A CN114417812A CN 114417812 A CN114417812 A CN 114417812A CN 202210249958 A CN202210249958 A CN 202210249958A CN 114417812 A CN114417812 A CN 114417812A
Authority
CN
China
Prior art keywords
text
webpage
original
target
preset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210249958.9A
Other languages
Chinese (zh)
Inventor
赵鹏飞
游妍
臧康
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Taiping Financial Technology Services Shanghai Co Ltd Shenzhen Branch
Original Assignee
Taiping Financial Technology Services Shanghai Co Ltd Shenzhen Branch
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Taiping Financial Technology Services Shanghai Co Ltd Shenzhen Branch filed Critical Taiping Financial Technology Services Shanghai Co Ltd Shenzhen Branch
Priority to CN202210249958.9A priority Critical patent/CN114417812A/en
Publication of CN114417812A publication Critical patent/CN114417812A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation
    • G06F16/9577Optimising the visualization of content, e.g. distillation of HTML documents

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Transfer Between Computers (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The application relates to a text checking method, a text checking device, text checking equipment and a storage medium, wherein the method comprises the following steps: acquiring a webpage text from a webpage to be tested according to a first preset acquisition rule, and acquiring an original text according to a second preset acquisition rule; the webpage text is generated based on the original text and is used for displaying in the webpage to be tested; preprocessing the webpage text and the original text to generate a target webpage text and a target original text; and performing text checking on the target webpage text and the target original text according to a preset checking strategy to generate a text checking result. The technical scheme provided by the application can improve the efficiency of text checking.

Description

Text checking method, device, equipment and storage medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a storage medium for text matching.
Background
With the rapid development of the mobile internet, the propaganda of enterprises to various products is more efficient and convenient. When an enterprise publicizes a product through a webpage, a publicity introduction scheme corresponding to the product needs to be displayed on a specific webpage, and meanwhile, in order to guarantee the accuracy of publicity introduction of the product and improve the customer experience, the publicity introduction scheme does not allow errors to occur. Therefore, it is necessary to check the publicity introduction documents on the web page with the saved correct publicity introduction documents in the test stage.
Conventionally, when performing text check on a propaganda introduction case on a webpage and a stored correct propaganda introduction case, on the premise of keeping the propaganda introduction case secret, an internal tester is required to manually perform text check. However, because the propaganda and introduction of the product have more contents, the efficiency of manually checking the texts is low.
Disclosure of Invention
Based on this, the embodiment of the application provides a text checking method, a text checking device and a storage medium, which can improve the efficiency of checking a text.
In a first aspect, a text checking method is provided, and the method includes:
acquiring a webpage text from a webpage to be tested according to a first preset acquisition rule, and acquiring an original text according to a second preset acquisition rule; the webpage text is generated based on the original text and is used for displaying in the webpage to be tested; preprocessing the webpage text and the original text to generate a target webpage text and a target original text; and performing text checking on the target webpage text and the target original text according to a preset checking strategy to generate a text checking result.
In one embodiment, the method for acquiring the webpage text from the webpage to be tested according to the first preset acquisition rule includes any one of the following modes:
calling a preset interface to obtain a webpage text from a webpage to be tested; acquiring a screenshot page of a webpage to be tested, and identifying the screenshot page by adopting an OCR technology to generate a webpage text; and acquiring a webpage text from the webpage to be tested by adopting a crawler technology.
In one embodiment, invoking a preset interface to obtain a webpage text from a webpage to be tested includes:
calling a preset interface to inject Javascript codes into the webpage to be tested; acquiring a character string corresponding to a webpage to be tested through the Javascript code; the character string comprises first text content and first format characters; and analyzing the first text content and the first format characters in the character string of the webpage to be tested to generate a webpage text.
In one embodiment, the obtaining the original text according to the second preset obtaining rule includes any one of the following manners:
acquiring an original text according to a preset absolute path; the absolute path comprises path information starting from a root directory of the original text to the original file identifier; acquiring an original text according to a preset relative path; the relative path comprises an original file identifier; and acquiring the original text from the file storage server in a preset acquisition mode, wherein the preset acquisition mode comprises any one of calling a preset interface and calling a preset file scheduling component.
In one embodiment, the preprocessing the web page text and the original text to generate a target web page text and a target original text includes:
denoising the webpage text and the original text respectively according to a preset regular expression rule base to obtain an intermediate webpage text and an intermediate original text; respectively carrying out segmentation processing on the intermediate webpage text and the intermediate original text according to preset segmentation type characters to generate a target webpage text and a target original text; the target webpage text comprises at least one first text sub-content corresponding to the intermediate webpage text; the target original text includes at least one second text sub-content corresponding to the intermediate original text.
In one embodiment, the method further includes:
cutting the intermediate original text according to the intermediate webpage text to generate an intermediate original text matched with the intermediate webpage text; and generating a target original text based on the intermediate original text matched with the intermediate webpage text.
In one embodiment, the text checking of the target webpage text and the target original text is performed according to a preset checking strategy, and a text checking result is generated, and the method comprises the following steps:
acquiring second text sub-content from the target original text; for each second text sub-content, performing text checking on the second text sub-content and the first text sub-content in the target webpage text to generate a text checking result; the text checking result comprises a checking result and a second text sub-content corresponding to the checking result.
In one embodiment, the method further includes:
if the checking result is that an error exists, acquiring second text subcontent corresponding to the checking result; and marking and displaying the second text sub-content.
In a second aspect, there is provided a text collating apparatus comprising:
the acquisition module is used for acquiring a webpage text from a webpage to be tested according to a first preset acquisition rule and acquiring an original text according to a second preset acquisition rule; the webpage text is generated based on the original text and is used for displaying in the webpage to be tested;
the preprocessing module is used for preprocessing the webpage text and the original text to generate a target webpage text and a target original text;
and the checking module is used for performing text checking on the target webpage text and the target original text according to a preset checking strategy to generate a text checking result.
In a third aspect, a computer device is provided, comprising a memory and a processor, the memory storing a computer program, the computer program, when executed by the processor, implementing the method steps in any of the embodiments of the first aspect.
In a fourth aspect, a computer-readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, is adapted to carry out the method steps of any of the embodiments of the first aspect described above.
According to the text checking method, the text checking device, the text checking equipment and the storage medium, the webpage text is obtained from the webpage to be tested according to the first preset obtaining rule, and the original text is obtained according to the second preset obtaining rule; preprocessing the webpage text and the original text to generate a target webpage text and a target original text; and performing text checking on the target webpage text and the target original text according to a preset checking strategy to generate a text checking result. In the technical scheme provided by the embodiment of the application, the acquired webpage text and the original text can be automatically compared through the text checking tool which is independently developed, so that the text checking is not required manually, and the text checking efficiency is improved on the premise of ensuring the privacy and the safety of text contents.
Drawings
FIG. 1 is a block diagram of a computer device provided by an embodiment of the present application;
fig. 2 is a flowchart of a text checking method according to an embodiment of the present application;
fig. 3 is a flowchart for acquiring a web page text according to an embodiment of the present disclosure;
fig. 4 is a flowchart for preprocessing a web page text and an original text according to an embodiment of the present disclosure;
fig. 5 is a flowchart for generating a target original text according to an embodiment of the present application;
FIG. 6 is a flowchart of generating a text reconciliation result according to an embodiment of the present application;
fig. 7 is a flowchart illustrating a text verification result according to an embodiment of the present application;
FIG. 8 is an overall block diagram of a text collation provided in the embodiments of the present application;
FIG. 9 is a schematic view of a text collation interface according to an embodiment of the present application;
FIG. 10 is a flowchart of a text reconciliation method according to yet another embodiment of the present application;
fig. 11 is a block diagram of a text collating apparatus according to an embodiment of the present application.
Detailed Description
To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
The text checking method provided by the application can be applied to computer equipment, the computer equipment can be a server or a terminal, wherein the server can be one server or a server cluster consisting of a plurality of servers.
Taking the example of a computer device being a server, FIG. 1 shows a block diagram of a server, which may include a processor, memory, and network interface connected by a system bus, as shown in FIG. 1. Wherein the processor of the server is configured to provide computing and control capabilities. The memory of the server comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The computer program is executed by a processor to implement a text collation method.
Those skilled in the art will appreciate that the architecture shown in fig. 1 is a block diagram of only a portion of the architecture associated with the subject application, and does not constitute a limitation on the servers to which the subject application applies, and that servers may alternatively include more or fewer components than those shown, or combine certain components, or have a different arrangement of components.
The execution subject of the embodiments of the present application may be a computer device, or may be a text verification apparatus, and the following method embodiments will be described with reference to the computer device as the execution subject.
In one embodiment, as shown in fig. 2, which shows a flowchart of a text reconciliation method provided by the embodiment of the present application, the method may include the following steps:
step 220, acquiring a webpage text from a webpage to be tested according to a first preset acquisition rule, and acquiring an original text according to a second preset acquisition rule; the webpage text is generated based on the original text and is used for displaying in the webpage to be tested.
The webpage to be tested is a webpage for displaying webpage texts, and the webpage texts are publicity documents introducing products to be sold. The webpage text is a text generated based on an original text, the original text is an original propaganda file for introducing the sold products, and the webpage text can be the whole text content of the original text or part of the text content in the original text. When text checking is carried out, a webpage text needs to be acquired from a webpage to be tested according to a first preset acquisition rule, and the first preset acquisition rule can be used for directly carrying out webpage element analysis on a website corresponding to the webpage to be tested so as to acquire the webpage text; the corresponding webpage text can be opened according to the website of the webpage to be tested in the browser by driving the browser, so that webpage elements of the webpage to be tested are analyzed to obtain the webpage text; the web page text may also be obtained in other manners, which is not specifically limited in this embodiment. When the original text is acquired, the original text may be acquired according to a second preset acquisition rule, where the second preset acquisition rule may be acquired according to a storage path of the original text, or may be acquired according to other manners, and this embodiment is not particularly limited to this.
And 240, preprocessing the webpage text and the original text to generate a target webpage text and a target original text.
After the webpage text and the original text are respectively obtained, the webpage text and the original text can be directly subjected to text check, so that a text check result is obtained; the webpage text and the original text can be preprocessed to generate a target webpage text and a target original text, and then the target webpage text and the target original text are subjected to text checking, so that a text checking result is obtained. The preprocessing operation performed on the web page text and the original text may include conversion processing performed on a text format, clipping processing performed on a text verification content interval, and other preprocessing operations, which are not specifically limited in this embodiment.
And step 260, performing text checking on the target webpage text and the target original text according to a preset checking strategy to generate a text checking result.
The text verification result may include a verification success and a verification failure, and when the text verification result is a verification failure, the text verification result may also include types of text missing, some character errors in the text, and the like, and may also be other types of verification failure. The check result may be output in a text form, may be output as a mark, or may be output in another form. If the text is output in a text form, for example, the specific text content failed in the verification can be displayed; if the output is a mark, for example, the specific text content that fails to be checked or succeeds in checking may be highlighted, the specific text content that succeeds in checking may be marked as green, and the specific text content that fails to be checked may be marked as red, which may be in other mark forms such as underlining and bolding, which is not limited in this embodiment.
In the embodiment, a webpage text is acquired from a webpage to be tested according to a first preset acquisition rule, and an original text is acquired according to a second preset acquisition rule; preprocessing the webpage text and the original text to generate a target webpage text and a target original text; and performing text checking on the target webpage text and the target original text according to a preset checking strategy to generate a text checking result. The obtained webpage text and the original text can be automatically compared through the independently developed text checking tool, manual text checking is not needed, and the text checking efficiency is improved on the premise that the privacy and the safety of text contents are guaranteed.
In one embodiment, the method for acquiring the webpage text from the webpage to be tested according to the first preset acquisition rule includes any one of the following modes: calling a preset interface to obtain a webpage text from a webpage to be tested; acquiring a screenshot page of a webpage to be tested, and identifying the screenshot page by adopting an OCR technology to generate a webpage text; and acquiring a webpage text from the webpage to be tested by adopting a crawler technology. The webpage text can be obtained through various modes, and the applicability and flexibility of text checking are improved.
In one embodiment, as shown in fig. 3, which illustrates a flowchart of a text verification method provided in an embodiment of the present application, specifically, related to a possible process of acquiring a webpage text, the method may include the following steps:
and 320, calling a preset interface to inject Javascript codes into the webpage to be tested.
Step 340, acquiring a character string corresponding to the webpage to be tested through the Javascript code; the character string includes a first text content and a first format character.
And 360, analyzing the first text content and the first format characters in the character string of the webpage to be tested to generate a webpage text.
When the webpage text of the page to be tested is obtained, the webpage frame can be used for calling the WebDriver to open the browser corresponding to the browser driving file, and different browsers can be opened by replacing the browser driving file. The webpage content is loaded by accessing the specified webpage address in the opened browser, the specified webpage address is the address for loading the page to be tested, and the webpage address can be set by default, can be input manually, and can also be a webpage address added in other modes.
Furthermore, a preset API (application programming interface) is used under the Selenium framework to inject a pre-written Javascript code into the webpage to be tested, the Javascript code is used for analyzing the webpage to be tested, and a character string corresponding to the webpage to be tested is obtained after the webpage to be tested is analyzed through the Javascript code, wherein the character string comprises first text content and first format characters. The first text content is specific text content, the first format characters are format characters in the first text content, and can be segmented characters such as "/r/n, \\ t", and the like, and are also called line-feed characters; other format characters are also possible. And analyzing the first text content and the first format characters in the character strings corresponding to the webpage to be tested, thereby generating a webpage text.
In the embodiment, a preset interface is called to inject Javascript codes into the webpage to be tested; acquiring a character string corresponding to the webpage to be tested through the Javascript code; and analyzing the first text content and the first format characters in the character string of the webpage to be tested to generate a webpage text. The webpage to be tested is analyzed through the pre-written Javascript code to generate a corresponding webpage text, and the obtaining mode is simple and efficient.
In one embodiment, the obtaining the original text according to the second preset obtaining rule includes any one of the following manners: acquiring an original text according to a preset absolute path; the absolute path includes path information starting from the root directory of the original text to the original file identification. The original text in the absolute path may be a file in a format of word, txt, pdf, excel, or the like, and different text reading components may be adopted to read the original text, for example, a character string corresponding to the original text in the word format may be read by a sphere. Reading an original text in an excel format through a Workbook factory; the original text in pdf format is read by PDFParser. Acquiring an original text according to a preset relative path; the relative path includes the original file identification. The original text can be stored locally, and specifically can be sent by mail or transmitted in other forms and then received to a local disk, so that the original text can be acquired according to an absolute path or a relative path. The original text can also be acquired from the file storage server through a preset acquisition mode, the preset acquisition mode includes any one of calling a preset interface and calling a preset file scheduling component, and for example, the original text stored in the cloud or the server can be acquired through file scheduling components such as ETL and MQ. Similarly, the original text can be obtained in various ways, and the applicability and flexibility of text verification are improved.
In one embodiment, as shown in fig. 4, which shows a flowchart of a text verification method provided in an embodiment of the present application, and particularly relates to a possible process of preprocessing a web page text and an original text, the method may include the following steps:
and 420, respectively denoising the webpage text and the original text according to a preset regular expression rule base to obtain an intermediate webpage text and an intermediate original text.
Step 440, segmenting the intermediate webpage text and the intermediate original text according to preset segmentation type characters to generate a target webpage text and a target original text; the target webpage text comprises at least one first text sub-content corresponding to the intermediate webpage text; the target original text includes at least one second text sub-content corresponding to the intermediate original text.
The method comprises the steps that a preset regular expression rule base can store codes, special characters, pictures and other contents needing to be deleted from a webpage text and an original text in advance, then segmentation processing is respectively carried out on a middle webpage text and the middle original text according to preset segmentation type characters, a target webpage text and a target original text are generated, and the target webpage text comprises at least one first text sub-content corresponding to the middle webpage text; the target original text includes at least one second text sub-content corresponding to the intermediate original text. After the corresponding segment type characters are searched in the character string, a line feed operation is carried out to form a segment text, so that the first text sub-content and the second text sub-content are obtained. Finally, after at least one first text sub-content corresponding to the intermediate web page text or at least one second text sub-content corresponding to the intermediate original text is obtained, the blank sections in each section can be deleted first and then the corresponding target web page text and the target original text are generated. And the generated target webpage text and the target original text can be stored in the memory and displayed, and the memory is released and rewritten after the target webpage text and the target original text are obtained again or the text checking operation is quitted.
In the embodiment, the intermediate webpage text and the intermediate original text are obtained by respectively carrying out denoising processing on the webpage text and the original text according to a preset regular expression rule base; and respectively carrying out segmentation processing on the intermediate webpage text and the intermediate original text according to preset segmentation type characters to generate a target webpage text and a target original text. By carrying out segmentation processing on the intermediate webpage text and the intermediate original text, the subsequent text segmentation checking is convenient, and the text checking efficiency is improved.
In one embodiment, as shown in fig. 5, which shows a flowchart of a text verification method provided in an embodiment of the present application, specifically, related to a possible process of generating a target original text, the method may include the following steps:
and 520, cutting the intermediate original text according to the intermediate webpage text to generate an intermediate original text matched with the intermediate webpage text.
And 540, generating a target original text based on the intermediate original text matched with the intermediate webpage text.
In order to facilitate text checking, the intermediate original text can be cut according to the intermediate webpage text to generate an intermediate original text matched with the intermediate webpage text. And enabling the first section of text of the cut intermediate original text content to be the same as the first section of text of the intermediate webpage text, and enabling the last section of text of the intermediate original text content to be the same as the last section of text of the intermediate webpage text. Thereby generating a target original text based on the intermediate original text matching the intermediate web page text.
In the embodiment, the intermediate original text is cut according to the intermediate webpage text to generate the intermediate original text matched with the intermediate webpage text; and generating a target original text based on the intermediate original text matched with the intermediate webpage text. By cutting the intermediate original text, the subsequent text segmentation check is facilitated, and the text check efficiency is improved.
In one embodiment, as shown in fig. 6, which illustrates a flowchart of a text verification method provided in an embodiment of the present application, specifically, related to a possible process of generating a text verification result, the method may include the following steps:
and step 620, acquiring the second text sub-content from the target original text.
Step 640, for each second text sub-content, performing text check on the second text sub-content and the first text sub-content in the target webpage text to generate a text check result; the text checking result comprises a checking result and a second text sub-content corresponding to the checking result.
Specifically, second text sub-content can be obtained from the target original text, and for each second text sub-content, the second text sub-content is text-checked with the first text sub-content in the target webpage text to obtain a text checking result. The target original text comprises at least one second text sub-content, and a text checking result can be obtained by searching and comparing each second text sub-content in the target webpage text section by section. And comparing the second text sub-content with the first text sub-content in the corresponding target webpage text character by character to locate the specific error content. The text checking may also be performed in batches according to the server performance, or may be performed in other checking manners, which is not specifically limited in this embodiment.
In the embodiment, the second text sub-content is obtained from the target original text; and aiming at each second text sub-content, performing text check on the second text sub-content and the first text sub-content in the target webpage text to generate a text check result. The second text sub-content is compared with the first text word content in the target webpage text section by section, so that the accuracy and the efficiency of text checking are improved.
In one embodiment, as shown in fig. 7, which illustrates a flowchart of a text verification method provided in an embodiment of the present application, and particularly relates to a possible process for displaying a text verification result, the method may include the following steps:
and 720, if the checking result is that an error exists, acquiring a second text sub-content corresponding to the checking result.
And step 740, marking and displaying the second text sub-content.
If the check result is error, acquiring a second text sub-content corresponding to the check result, and highlighting the second text word content; and comparing the second text sub-content with the first text sub-content in the corresponding target webpage text character by character, positioning the specific error content, and displaying the error content in a display box.
In this embodiment, if the checking result is that an error exists, acquiring a second text sub-content corresponding to the checking result; the second text sub-content is marked and displayed, so that the text content with errors can be visually displayed for the user, and the intelligence of the text checking tool is improved.
In an embodiment, as shown in fig. 8, which illustrates an overall frame diagram for text collation provided in the embodiment of the present application, when performing text collation, a web page address of a page to be tested can be obtained according to a text collation requirement provided by a tester, and a Word file is obtained, and a text content in an original text in the Word file can be cut according to a web page text to obtain an interval required for text collation. When the webpage text is loaded, the webpage to be tested corresponding to the webpage address can be loaded through Webdriver driven analysis or HTML analysis; when the original text in the Word file is loaded, the Word file can be analyzed through a text reading component spiral. Therefore, the text comparison of the HTML and the Word content can be carried out, and the comparison result can be highlighted so as to facilitate the comparison of the executive personnel. Specifically, when performing text check, as shown in fig. 9, fig. 9 is a schematic view of a text checking operation interface provided in the embodiment of the present application, and a text checking flow based on the schematic view of the text checking operation interface is shown in fig. 10, which shows a flow chart of a text checking method provided in the embodiment of the present application, where the method may include the following steps:
step 1001, acquiring a webpage address of a page to be tested according to the text checking requirement of a tester.
And step 1002, loading a page to be tested corresponding to the webpage address through a WebDriver drive analysis button or an HTML analysis button.
And 1003, acquiring the original text according to a preset storage path of the original text by triggering a text loading button.
And 1004, injecting a pre-written Javascript code into the webpage to be tested by triggering a webpage text acquiring button, so as to acquire a character string corresponding to the webpage to be tested.
Step 1005, segmenting the first text content in the character string according to the segmented characters to generate at least one first text sub-content.
And step 1006, generating a target webpage text based on the at least one first text sub-content, and displaying the target webpage text.
Step 1007, inputting the line number in the text interval button, and cutting the second text content in the target original text according to the target webpage text through the cutting button to generate the second text content matched with the target webpage text.
And step 1008, by triggering the start comparison button or the line comparison button, performing section-by-section or character-by-character comparison on the second text subcontent and the first text subcontent in the target webpage text to obtain a text comparison result.
And step 1009, outputting a text checking result.
The implementation principle and technical effect of each step in the text verification method provided by this embodiment are similar to those in the foregoing text verification method embodiments, and are not described herein again. The implementation manner of each step in the embodiment of fig. 10 is only an example, and is not limited to this, and the order of each step may be adjusted in practical application as long as the purpose of each step can be achieved.
In the technical scheme provided by the embodiment of the application, the acquired webpage text and the original text can be automatically compared through the text checking tool which is independently developed, so that the text checking is not required manually, and the text checking efficiency is improved on the premise of ensuring the privacy and the safety of text contents.
It should be understood that although the various steps in the flow charts of fig. 2-10 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2-10 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed in turn or alternately with other steps or at least some of the other steps.
Referring to fig. 11, a block diagram of a text verification apparatus 1100 according to an embodiment of the present application is shown. As shown in fig. 11, the text collating apparatus 1100 may include: an acquisition module 1102, a preprocessing module 1104, and a reconciliation module 1106, wherein:
the acquiring module 1102 is configured to acquire a webpage text from a webpage to be tested according to a first preset acquiring rule, and acquire an original text according to a second preset acquiring rule; the webpage text is generated based on the original text and is used for displaying in the webpage to be tested;
the preprocessing module 1104 is used for preprocessing the webpage text and the original text to generate a target webpage text and a target original text;
and a checking module 1106, configured to perform text checking on the target webpage text and the target original text according to a preset checking policy, so as to generate a text checking result.
In an embodiment, the obtaining module 1102 is specifically configured to invoke a preset interface to obtain a webpage text from a webpage to be tested; acquiring a screenshot page of a webpage to be tested, and identifying the screenshot page by adopting an OCR technology to generate a webpage text; and acquiring a webpage text from the webpage to be tested by adopting a crawler technology.
In an embodiment, the obtaining module 1102 is further configured to call a preset interface to inject a Javascript code into the webpage to be tested; acquiring a character string corresponding to a webpage to be tested through the Javascript code; the character string comprises first text content and first format characters; and analyzing the first text content and the first format characters in the character string of the webpage to be tested to generate a webpage text.
In an embodiment, the obtaining module 1102 is further configured to obtain an original text according to a preset absolute path; the absolute path comprises path information starting from a root directory of the original text to the original file identifier; acquiring an original text according to a preset relative path; the relative path comprises an original file identifier; and acquiring the original text from the file storage server in a preset acquisition mode, wherein the preset acquisition mode comprises any one of calling a preset interface and calling a preset file scheduling component.
In an embodiment, the preprocessing module 1104 is specifically configured to perform denoising processing on the web page text and the original text according to a preset regular expression rule base, so as to obtain an intermediate web page text and an intermediate original text; respectively carrying out segmentation processing on the intermediate webpage text and the intermediate original text according to preset segmentation type characters to generate a target webpage text and a target original text; the target webpage text comprises at least one first text sub-content corresponding to the intermediate webpage text; the target original text includes at least one second text sub-content corresponding to the intermediate original text.
In an embodiment, the preprocessing module 1104 is further specifically configured to cut the intermediate original text according to the intermediate web page text, and generate an intermediate original text matched with the intermediate web page text; and generating a target original text based on the intermediate original text matched with the intermediate webpage text.
In one embodiment, the checking module 1106 is specifically configured to obtain a second text sub-content from the target original text; for each second text sub-content, performing text checking on the second text sub-content and the first text sub-content in the target webpage text to generate a text checking result; the text checking result comprises a checking result and a second text sub-content corresponding to the checking result.
In an embodiment, the checking module 1106 is further configured to, if the checking result is that there is an error, obtain a second text sub-content corresponding to the checking result; and marking and displaying the second text sub-content.
For the specific definition of the text verification device, reference may be made to the above definition of the text verification method, which is not described herein again. The respective modules in the text collating device may be wholly or partially implemented by software, hardware, or a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute the operations of the modules.
In one embodiment of the present application, there is provided a computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the following steps when executing the computer program:
acquiring a webpage text from a webpage to be tested according to a first preset acquisition rule, and acquiring an original text according to a second preset acquisition rule; the webpage text is generated based on the original text and is used for displaying in the webpage to be tested; preprocessing the webpage text and the original text to generate a target webpage text and a target original text; and performing text checking on the target webpage text and the target original text according to a preset checking strategy to generate a text checking result.
In one embodiment of the application, the processor when executing the computer program further performs the steps of:
calling a preset interface to obtain a webpage text from a webpage to be tested; acquiring a screenshot page of a webpage to be tested, and identifying the screenshot page by adopting an OCR technology to generate a webpage text; and acquiring a webpage text from the webpage to be tested by adopting a crawler technology.
In one embodiment of the application, the processor when executing the computer program further performs the steps of:
calling a preset interface to inject Javascript codes into the webpage to be tested; acquiring a character string corresponding to a webpage to be tested through the Javascript code; the character string comprises first text content and first format characters; and analyzing the first text content and the first format characters in the character string of the webpage to be tested to generate a webpage text.
In one embodiment of the application, the processor when executing the computer program further performs the steps of:
acquiring an original text according to a preset absolute path; the absolute path comprises path information starting from a root directory of the original text to the original file identifier; acquiring an original text according to a preset relative path; the relative path comprises an original file identifier; and acquiring the original text from the file storage server in a preset acquisition mode, wherein the preset acquisition mode comprises any one of calling a preset interface and calling a preset file scheduling component.
In one embodiment of the application, the processor when executing the computer program further performs the steps of:
denoising the webpage text and the original text respectively according to a preset regular expression rule base to obtain an intermediate webpage text and an intermediate original text; respectively carrying out segmentation processing on the intermediate webpage text and the intermediate original text according to preset segmentation type characters to generate a target webpage text and a target original text; the target webpage text comprises at least one first text sub-content corresponding to the intermediate webpage text; the target original text includes at least one second text sub-content corresponding to the intermediate original text.
In one embodiment of the application, the processor when executing the computer program further performs the steps of:
cutting the intermediate original text according to the intermediate webpage text to generate an intermediate original text matched with the intermediate webpage text; and generating a target original text based on the intermediate original text matched with the intermediate webpage text.
In one embodiment of the application, the processor when executing the computer program further performs the steps of:
acquiring second text sub-content from the target original text; for each second text sub-content, performing text checking on the second text sub-content and the first text sub-content in the target webpage text to generate a text checking result; the text checking result comprises a checking result and a second text sub-content corresponding to the checking result.
In one embodiment of the application, the processor when executing the computer program further performs the steps of:
if the checking result is that an error exists, acquiring second text subcontent corresponding to the checking result; and marking and displaying the second text sub-content.
The implementation principle and technical effect of the computer device provided by the embodiment of the present application are similar to those of the method embodiment described above, and are not described herein again.
In an embodiment of the application, a computer-readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of:
acquiring a webpage text from a webpage to be tested according to a first preset acquisition rule, and acquiring an original text according to a second preset acquisition rule; the webpage text is generated based on the original text and is used for displaying in the webpage to be tested; preprocessing the webpage text and the original text to generate a target webpage text and a target original text; and performing text checking on the target webpage text and the target original text according to a preset checking strategy to generate a text checking result.
In one embodiment of the application, the computer program when executed by the processor further performs the steps of:
calling a preset interface to obtain a webpage text from a webpage to be tested; acquiring a screenshot page of a webpage to be tested, and identifying the screenshot page by adopting an OCR technology to generate a webpage text; and acquiring a webpage text from the webpage to be tested by adopting a crawler technology.
In one embodiment of the application, the computer program when executed by the processor further performs the steps of:
calling a preset interface to inject Javascript codes into the webpage to be tested; acquiring a character string corresponding to a webpage to be tested through the Javascript code; the character string comprises first text content and first format characters; and analyzing the first text content and the first format characters in the character string of the webpage to be tested to generate a webpage text.
In one embodiment of the application, the computer program when executed by the processor further performs the steps of:
acquiring an original text according to a preset absolute path; the absolute path comprises path information starting from a root directory of the original text to the original file identifier; acquiring an original text according to a preset relative path; the relative path comprises an original file identifier; and acquiring the original text from the file storage server in a preset acquisition mode, wherein the preset acquisition mode comprises any one of calling a preset interface and calling a preset file scheduling component.
In one embodiment of the application, the computer program when executed by the processor further performs the steps of:
denoising the webpage text and the original text respectively according to a preset regular expression rule base to obtain an intermediate webpage text and an intermediate original text; respectively carrying out segmentation processing on the intermediate webpage text and the intermediate original text according to preset segmentation type characters to generate a target webpage text and a target original text; the target webpage text comprises at least one first text sub-content corresponding to the intermediate webpage text; the target original text includes at least one second text sub-content corresponding to the intermediate original text.
In one embodiment of the application, the computer program when executed by the processor further performs the steps of:
cutting the intermediate original text according to the intermediate webpage text to generate an intermediate original text matched with the intermediate webpage text; and generating a target original text based on the intermediate original text matched with the intermediate webpage text.
In one embodiment of the application, the computer program when executed by the processor further performs the steps of:
acquiring second text sub-content from the target original text; for each second text sub-content, performing text checking on the second text sub-content and the first text sub-content in the target webpage text to generate a text checking result; the text checking result comprises a checking result and a second text sub-content corresponding to the checking result.
In one embodiment of the application, the computer program when executed by the processor further performs the steps of:
if the checking result is that an error exists, acquiring second text subcontent corresponding to the checking result; and marking and displaying the second text sub-content.
The implementation principle and technical effect of the computer-readable storage medium provided by this embodiment are similar to those of the above-described method embodiment, and are not described herein again.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above examples only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the claims. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (11)

1. A method for collating text, said method comprising:
acquiring a webpage text from a webpage to be tested according to a first preset acquisition rule, and acquiring an original text according to a second preset acquisition rule; the webpage text is generated based on the original text and is used for displaying in the webpage to be tested;
preprocessing the webpage text and the original text to generate a target webpage text and a target original text;
and performing text checking on the target webpage text and the target original text according to a preset checking strategy to generate a text checking result.
2. The method of claim 1, wherein the obtaining the webpage text from the webpage to be tested according to the first preset obtaining rule comprises any one of the following manners:
calling a preset interface to obtain the webpage text from the webpage to be tested;
acquiring a screenshot page of the webpage to be tested, and identifying the screenshot page by adopting an OCR technology to generate a webpage text;
and acquiring the webpage text from the webpage to be tested by adopting a crawler technology.
3. The method of claim 2, wherein the calling the default interface to obtain the web page text from the web page to be tested comprises:
calling a preset interface to inject Javascript codes into the webpage to be tested;
acquiring a character string corresponding to the webpage to be tested through the Javascript code; the character string comprises first text content and first format characters;
and analyzing the first text content and the first format characters in the character string of the webpage to be tested to generate the webpage text.
4. The method according to claim 1, wherein the obtaining the original text according to the second preset obtaining rule includes any one of:
acquiring the original text according to a preset absolute path; the absolute path comprises path information starting from a root directory of the original text to the original file identification;
acquiring the original text according to a preset relative path; the relative path comprises the original file identification;
and acquiring the original text from a file storage server through a preset acquisition mode, wherein the preset acquisition mode comprises any one of calling a preset interface and calling a preset file scheduling component.
5. The method according to any one of claims 1-4, wherein the preprocessing the web page text and the original text to generate a target web page text and a target original text comprises:
denoising the webpage text and the original text respectively according to a preset regular expression rule base to obtain an intermediate webpage text and an intermediate original text;
respectively carrying out segmentation processing on the intermediate webpage text and the intermediate original text according to preset segmentation type characters to generate the target webpage text and the target original text; the target webpage text comprises at least one first text sub-content corresponding to the intermediate webpage text; the target original text includes at least one second text sub-content corresponding to the intermediate original text.
6. The method of claim 5, further comprising:
cutting the intermediate original text according to the intermediate webpage text to generate an intermediate original text matched with the intermediate webpage text;
and generating the target original text based on the intermediate original text matched with the intermediate webpage text.
7. The method of claim 6, wherein the text matching the target webpage text and the target original text according to a preset matching policy to generate a text matching result, comprising:
acquiring the second text sub-content from the target original text;
for each second text sub-content, performing text checking on the second text sub-content and the first text sub-content in the target webpage text to generate a text checking result; the text checking result comprises a checking result and the second text sub-content corresponding to the checking result.
8. The method of claim 7, further comprising:
if the check result is that an error exists, acquiring the second text sub-content corresponding to the check result;
and marking and displaying the second text sub-content.
9. A text collation apparatus, characterized in that said apparatus comprises:
the acquisition module is used for acquiring a webpage text from a webpage to be tested according to a first preset acquisition rule and acquiring an original text according to a second preset acquisition rule; the webpage text is generated based on the original text and is used for displaying in the webpage to be tested;
the preprocessing module is used for preprocessing the webpage text and the original text to generate a target webpage text and a target original text;
and the checking module is used for performing text checking on the target webpage text and the target original text according to a preset checking strategy to generate a text checking result.
10. A computer arrangement comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, carries out the steps of the method according to any one of claims 1 to 8.
11. A computer-readable storage medium, characterized in that a computer program is stored thereon, which computer program, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 8.
CN202210249958.9A 2022-03-15 2022-03-15 Text checking method, device, equipment and storage medium Pending CN114417812A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210249958.9A CN114417812A (en) 2022-03-15 2022-03-15 Text checking method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210249958.9A CN114417812A (en) 2022-03-15 2022-03-15 Text checking method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114417812A true CN114417812A (en) 2022-04-29

Family

ID=81264538

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210249958.9A Pending CN114417812A (en) 2022-03-15 2022-03-15 Text checking method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114417812A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115796145A (en) * 2022-11-16 2023-03-14 珠海横琴指数动力科技有限公司 Method, system, server and readable storage medium for acquiring webpage text

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109710834A (en) * 2018-11-16 2019-05-03 北京字节跳动网络技术有限公司 Similar web page detection method, device, storage medium and electronic equipment
CN111737965A (en) * 2020-05-29 2020-10-02 北京百度网讯科技有限公司 Document comparison method and device, electronic equipment and readable storage medium
CN112989158A (en) * 2019-12-16 2021-06-18 顺丰科技有限公司 Method, device and storage medium for identifying webpage crawler behavior

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109710834A (en) * 2018-11-16 2019-05-03 北京字节跳动网络技术有限公司 Similar web page detection method, device, storage medium and electronic equipment
CN112989158A (en) * 2019-12-16 2021-06-18 顺丰科技有限公司 Method, device and storage medium for identifying webpage crawler behavior
CN111737965A (en) * 2020-05-29 2020-10-02 北京百度网讯科技有限公司 Document comparison method and device, electronic equipment and readable storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
施聪莺 等: "《教育大数据理论与实践》", 31 December 2019, 南京师范大学出版社, pages: 138 - 141 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115796145A (en) * 2022-11-16 2023-03-14 珠海横琴指数动力科技有限公司 Method, system, server and readable storage medium for acquiring webpage text
CN115796145B (en) * 2022-11-16 2023-09-08 珠海横琴指数动力科技有限公司 Webpage text acquisition method, system, server and readable storage medium

Similar Documents

Publication Publication Date Title
CN111274782B (en) Text auditing method and device, computer equipment and readable storage medium
CN111061526B (en) Automatic test method, device, computer equipment and storage medium
CN109033058B (en) Contract text verification method, apparatus, computer device and storage medium
CN111176996A (en) Test case generation method and device, computer equipment and storage medium
CN108874661B (en) Test mapping relation library generation method and device, computer equipment and storage medium
CN110955608B (en) Test data processing method, device, computer equipment and storage medium
CN110457628A (en) Webpage edition correcting method, device, equipment and storage medium
CN109325058B (en) Rule batch comparison method, device, computer equipment and storage medium
CN111460254B (en) Webpage crawling method and device based on multithreading, storage medium and equipment
CN112580363A (en) Requirement document processing method and device, computer equipment and storage medium
CN107832227B (en) Interface parameter testing method, device, equipment and storage medium of business system
CN114417812A (en) Text checking method, device, equipment and storage medium
CN114357174B (en) Code classification system and method based on OCR and machine learning
CN112417899A (en) Character translation method, device, computer equipment and storage medium
CN113505078B (en) Configuration file updating method, device, equipment and storage medium
JP7053017B2 (en) Web inspection program and web inspection equipment
CN114003692A (en) Contract text information processing method and device, computer equipment and storage medium
CN114968725A (en) Task dependency relationship correction method and device, computer equipment and storage medium
CN113742192A (en) Log rule quality analysis method, system, electronic device and storage medium
CN115145674A (en) Page jump method, device, equipment and medium based on dynamic anchor point
CN113868210A (en) Validity verification method, system, equipment and storage medium for imported data
CN109376536B (en) Cookie acquisition method, cookie acquisition device, computer equipment and storage medium
US20150324333A1 (en) Systems and methods for automatically generating hyperlinks
CN112749294B (en) Page hidden text recognition method, device, computer equipment and storage medium
CN113033149B (en) User story document quality inspection method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20220429

RJ01 Rejection of invention patent application after publication