CN106815196B - Soft text display frequency statistical method and device - Google Patents

Soft text display frequency statistical method and device Download PDF

Info

Publication number
CN106815196B
CN106815196B CN201510850381.7A CN201510850381A CN106815196B CN 106815196 B CN106815196 B CN 106815196B CN 201510850381 A CN201510850381 A CN 201510850381A CN 106815196 B CN106815196 B CN 106815196B
Authority
CN
China
Prior art keywords
content
content block
text
target soft
webpage
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510850381.7A
Other languages
Chinese (zh)
Other versions
CN106815196A (en
Inventor
王名洋
吴丹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201510850381.7A priority Critical patent/CN106815196B/en
Publication of CN106815196A publication Critical patent/CN106815196A/en
Application granted granted Critical
Publication of CN106815196B publication Critical patent/CN106815196B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Transfer Between Computers (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a method and a device for counting the showing times of soft texts. Wherein, the method comprises the following steps: acquiring a plurality of webpage contents, wherein the webpage contents are contents of a plurality of webpages in a search result page; respectively calculating the text editing distance of each webpage content and the target soft text in the plurality of webpage contents; judging whether each webpage content is the same as the target soft text or not according to the text editing distance of each webpage content and the target soft text in the plurality of webpage contents; and counting the number of the webpage contents which are the same as the target soft text in the plurality of webpage contents as the showing times of the target soft text. According to the method and the device, the technical problem that the efficiency is low when the showing times of the soft text are counted manually in the related technology is solved.

Description

Soft text display frequency statistical method and device
Technical Field
The application relates to the field of data processing, in particular to a method and a device for counting the showing times of soft texts.
Background
The soft text is a text advertisement, for example, some promotional and explanatory articles published on a newspaper, magazine or network propaganda carrier for promoting enterprise brand image and popularity, or promoting enterprise sales, including specific news reports, deep articles, paid short text advertisements, case analysis, and the like. Some enterprises may make a batch of softwares based on brand keywords or product keywords for reasons such as increasing brand reputation or brand exposure, and then launch the softwares on a plurality of external websites.
In order to analyze the delivery effect of the soft texts, the number of times the delivered soft texts are presented and the ranking condition of the delivered soft texts in the search results of the specific keywords at the search end are generally required to be counted. In the prior art, keywords are manually searched, then each link of a search result page is opened, corresponding webpage content is checked, and the showing quantity of soft texts and the ranking condition of the soft texts are counted. The manual operation mode is not only inefficient, but also the statistical result is easy to make mistakes.
Aiming at the problem of low efficiency of counting the showing times of the soft text in a manual mode in the related technology, an effective solution is not provided at present.
Disclosure of Invention
The application mainly aims to provide a method and a device for counting the showing times of a soft text, so as to solve the problem that the efficiency of counting the showing times of the soft text in a manual mode in the related art is low.
In order to achieve the above object, according to one aspect of the present application, a method for counting the number of times a soft text is presented is provided. The method comprises the following steps: acquiring a plurality of webpage contents, wherein the webpage contents are contents of a plurality of webpages in a search result page; respectively calculating the text editing distance of each webpage content and the target soft text in the plurality of webpage contents; judging whether each webpage content is the same as the target soft text or not according to the text editing distance of each webpage content and the target soft text in the plurality of webpage contents; and counting the number of the webpage contents which are the same as the target soft text in the plurality of webpage contents as the showing times of the target soft text.
Further, the determining whether each web page content is the same as the target soft text according to the text edit distance between each web page content and the target soft text in the plurality of web page contents includes: counting the length of the target soft text; calculating the ratio of the text editing distance between the first webpage content and the target soft text to the length of the target soft text; judging whether the ratio of the text editing distance between the first webpage content and the target soft text to the length of the target soft text is smaller than a first threshold value or not; when the ratio of the text editing distance between the first webpage content and the target soft text to the length of the target soft text is judged to be smaller than a first threshold value, determining that the first webpage content is the same as the target soft text; and when the ratio of the text editing distance between the first webpage content and the target soft text to the length of the target soft text is judged to be not smaller than a first threshold value, determining that the first webpage content is different from the target soft text.
Further, the plurality of web page contents include a first web page content, the calculating the text editing distance of each web page content and the target soft text in the plurality of web page contents includes calculating the text editing distance of the first web page content and the target soft text, and the calculating the text editing distance of the first web page content and the target soft text includes: respectively blocking the first webpage content and the target soft text to obtain a first content block list and a second content block list, wherein the first content block list is a content block list obtained after the first webpage content is blocked, and the second content block list is a content block list obtained after the target soft text is blocked; and respectively calculating the text editing distance between each content block in the first content block list and each content block in the second content block list.
Further, the step of judging whether each webpage content is the same as the target soft text according to the text editing distance of each webpage content and the target soft text in the plurality of webpage contents comprises the following steps: acquiring content blocks in the second content block list, which are the same as the content blocks in the first content block list, according to the text editing distance between each content block in the first content block list and each content block in the second content block list; respectively counting the length of the content blocks in the second content block list, which are the same as the content blocks in the first content block list, and the length of the target soft text; calculating the ratio of the length of the content block in the second content block list, which is the same as the content block in the first content block list, to the length of the target soft text; judging whether the ratio of the length of the content block in the second content block list, which is the same as the content block in the first content block list, to the length of the target soft text is greater than a second threshold value; when the ratio of the length of the content block in the second content block list, which is the same as the content block in the first content block list, to the length of the target soft text is judged to be larger than a second threshold value, the first webpage content is determined to be the same as the target soft text; and when the ratio of the length of the content block in the second content block list, which is the same as the content block in the first content block list, to the length of the target soft text is judged to be not larger than a second threshold value, determining that the first webpage content is different from the target soft text.
Further, the second content block list includes a first content block, and the obtaining of content blocks in the second content block list that are identical to content blocks in the first content block list according to text editing distances between the content blocks in the first content block list and the content blocks in the second content block list includes: counting the length of the first content block; respectively calculating the ratio of the text editing distance between each content block and the first content block in the first content block list to the length of the first content block to obtain a plurality of ratios; judging whether a ratio smaller than a third threshold value exists in the plurality of ratios; when the ratio smaller than the third threshold value does not exist in the plurality of ratios, determining that the content block same as the first content block does not exist in the first content block list; and when the ratio smaller than the third threshold value is judged to exist in the ratios, determining that the content block same as the first content block exists in the first content block list, and acquiring the first content block.
Further, after counting the number of the web page contents in the plurality of web page contents which are the same as the target soft text as the number of times of showing the target soft text, the method further comprises the following steps: respectively obtaining the ranking of the webpage content which is the same as the target soft text in the plurality of webpage contents; and displaying the display times of the target soft text and the ranking of the webpage content which is the same as the target soft text in the plurality of webpage contents.
In order to achieve the above object, according to another aspect of the present application, there is provided a soft text presentation number counting apparatus. The device includes: the device comprises a first acquisition unit, a second acquisition unit and a search result unit, wherein the first acquisition unit is used for acquiring a plurality of webpage contents, and the webpage contents are contents of a plurality of webpages in a search result page; the calculation unit is used for calculating the text editing distance of each webpage content and the target soft text in the plurality of webpage contents respectively; the judging unit is used for judging whether each webpage content is the same as the target soft text or not according to the text editing distance of each webpage content and the target soft text in the plurality of webpage contents; and the counting unit is used for counting the number of the webpage contents which are the same as the target soft text in the plurality of webpage contents as the display times of the target soft text.
Further, the plurality of web contents includes a first web content, and the judging unit includes: the first statistic module is used for counting the length of the target soft text; the first calculation module is used for calculating the ratio of the text editing distance between the first webpage content and the target soft text to the length of the target soft text; the first judgment module is used for judging whether the ratio of the text editing distance between the first webpage content and the target soft text to the length of the target soft text is smaller than a first threshold value or not; and the first determining module is used for determining that the first webpage content is the same as the target soft text when the ratio of the text editing distance between the first webpage content and the target soft text to the length of the target soft text is judged to be smaller than a first threshold value, and determining that the first webpage content is different from the target soft text when the ratio of the text editing distance between the first webpage content and the target soft text to the length of the target soft text is judged to be not smaller than the first threshold value.
Further, the plurality of web contents includes a first web content, and the calculation unit includes: the blocking module is used for respectively blocking the first webpage content and the target soft text to obtain a first content block list and a second content block list, wherein the first content block list is a content block list obtained after the first webpage content is blocked, and the second content block list is a content block list obtained after the target soft text is blocked; and the second calculation module is used for calculating the text editing distance between each content block in the first content block list and each content block in the second content block list.
Further, the judging unit includes: the acquisition module is used for acquiring content blocks in the second content block list, which are the same as the content blocks in the first content block list, according to the text editing distance between each content block in the first content block list and each content block in the second content block list; the second counting module is used for respectively counting the length of a content block in the second content block list, which is the same as the content block in the first content block list, and the length of the target soft text; the third calculation module is used for calculating the ratio of the length of the content block in the second content block list, which is the same as the content block in the first content block list, to the length of the target soft text; the second judging module is used for judging whether the ratio of the length of the content block in the second content block list, which is the same as the content block in the first content block list, to the length of the target soft text is greater than a second threshold value; and the second determining module is used for determining that the first webpage content is the same as the target soft text when the ratio of the length of the content block in the second content block list, which is the same as the content block in the first content block list, to the length of the target soft text is judged to be larger than a second threshold, and determining that the first webpage content is different from the target soft text when the ratio of the length of the content block in the second content block list, which is the same as the content block in the first content block list, to the length of the target soft text is judged to be not larger than the second threshold.
The method comprises the steps of obtaining a plurality of webpage contents, wherein the webpage contents are contents of a plurality of webpages in a search result page; respectively calculating the text editing distance of each webpage content and the target soft text in the plurality of webpage contents; judging whether each webpage content is the same as the target soft text or not according to the text editing distance of each webpage content and the target soft text in the plurality of webpage contents; and counting the number of the webpage contents which are the same as the target soft text in the plurality of webpage contents as the showing times of the target soft text. According to the method and the device, all the webpage contents matched with the search keywords are automatically acquired, the showing times of the target soft texts are counted according to the text editing distances between all the webpage contents matched with the search keywords and the target soft texts, compared with the prior art that the showing times of the soft texts are counted in a manual mode, the speed is higher, the problem that the showing times of the soft texts in the related technology are counted in a manual mode and the efficiency is lower is solved, and the effect of improving the showing times counting efficiency of the soft texts is achieved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application. In the drawings:
FIG. 1 is a flowchart of a method for counting the number of times a document is presented according to an embodiment of the present application; and
fig. 2 is a schematic diagram of a device for counting the number of times of presentation of a soft text according to an embodiment of the present application.
Detailed Description
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.
In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It should be understood that the data so used may be interchanged under appropriate circumstances such that embodiments of the application described herein may be used. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
According to the embodiment of the application, a method for counting the showing times of the soft text is provided. Fig. 1 is a flowchart of a statistical method for showing times of a soft text according to an embodiment of the present application, as shown in fig. 1, the method includes steps S102 to S108 as follows:
step S102, a plurality of webpage contents are obtained, wherein the webpage contents are contents of a plurality of webpages in a search result page.
The search result page in the embodiment of the application is a search result page obtained by searching based on a search keyword, where the search keyword may be a keyword associated with a target soft text, for example, if the target soft text is a soft text released based on a certain brand keyword, the search keyword may be the brand keyword, or a keyword associated with the brand keyword, and if the target soft text is a soft text released based on a certain product keyword, the search keyword may be the product keyword, or a keyword associated with the product keyword, and the like. It should be noted that the search keyword in the embodiment of the present application may be one or more.
Specifically, after receiving an externally input search keyword, the embodiment of the present application may crawl, by a web crawler, web content in each web link in a search result page corresponding to the search keyword (i.e., a plurality of web content matched with the search keyword), where the web content in the embodiment of the present application refers to text content in a web page.
And step S104, respectively calculating the text editing distance of each webpage content and the target soft text in the plurality of webpage contents.
The text editing distance of the embodiment of the present application refers to the minimum number of editing operations required to convert one character string into another character string between two character strings, wherein the allowed editing operations include replacing one character with another, inserting one character, and deleting one character, for example, the two character strings are ABC and ABCD, respectively, and the text editing distance of the character strings ABC and ABCD is 1 when only the character D needs to be added, i.e., only one operation needs to be performed. Generally, the smaller the text edit distance, the greater the similarity of two character strings. The target soft text in the embodiment of the application may refer to a soft text which needs to be monitored for the release effect at present.
In the embodiment of the application, after a plurality of web page contents matched with the search keyword are acquired, text editing distances of each web page content and the target soft text in the plurality of web page contents are respectively calculated to obtain a plurality of text editing distances, for example, if 10 web page contents (i.e., web page contents 1 to 10) are matched with the search keyword, the text editing distances of each web page content and the target soft text in the web page contents 1 to 10 are respectively calculated to obtain 10 text editing distances.
Preferably, in order to improve the accuracy of the statistical result, before the text editing distance of each of the web page contents and the target soft text in the web page contents is calculated, the invalid characters in the web page contents and the target soft text may be filtered, wherein the invalid characters may be punctuation marks, spaces, and the like, and then the text editing distance is calculated according to the web page contents after the invalid characters are filtered and the target soft text after the invalid characters are filtered.
And step S106, judging whether each webpage content is the same as the target soft text or not according to the text editing distance of each webpage content and the target soft text in the plurality of webpage contents.
After the text editing distance of each web page content and the target soft text in the multiple web page contents is obtained, whether each web page content is the same as the target soft text or not can be judged according to the text editing distance of each web page content and the target soft text, for example, each text editing distance is compared with a threshold value, if the text editing distance of a certain web page content and the target soft text is smaller than the threshold value, the web page content is determined to be the same as the target soft text, and if not, the web page content is determined to be different from the target soft text.
Step S108, counting the number of the webpage contents which are the same as the target soft text in the plurality of webpage contents as the display times of the target soft text.
Specifically, the number of the web page contents in the search result page corresponding to the search keyword, which are the same as the target soft text, represents the number of times that the target soft text is displayed in the search result page. When the number of the webpage contents matched with the search keywords is large, the statistical efficiency can be greatly improved, the labor cost is saved, and the accuracy of the statistical result can be improved.
According to the method and the device, a plurality of webpage contents are obtained, wherein the webpage contents are contents of a plurality of webpages in a search result page; respectively calculating the text editing distance of each webpage content and the target soft text in the plurality of webpage contents; judging whether each webpage content is the same as the target soft text or not according to the text editing distance of each webpage content and the target soft text in the plurality of webpage contents; and counting the number of the webpage contents which are the same as the target soft text in the plurality of webpage contents as the showing times of the target soft text. According to the method and the device, all the webpage contents matched with the search keywords are automatically acquired, the showing times of the target soft texts are counted according to the text editing distances between all the webpage contents matched with the search keywords and the target soft texts, compared with the prior art that the showing times of the soft texts are counted in a manual mode, the speed is higher, the problem that the showing times of the soft texts in the related technology are counted in a manual mode and the efficiency is lower is solved, and the effect of improving the showing times counting efficiency of the soft texts is achieved.
Optionally, the determining, by the first web page content, whether each web page content is the same as the target soft text according to the text editing distance between each web page content and the target soft text in the plurality of web page contents includes: counting the length of the target soft text; calculating the ratio of the text editing distance between the first webpage content and the target soft text to the length of the target soft text; judging whether the ratio of the text editing distance between the first webpage content and the target soft text to the length of the target soft text is smaller than a first threshold value or not; when the ratio of the text editing distance between the first webpage content and the target soft text to the length of the target soft text is judged to be smaller than a first threshold value, determining that the first webpage content is the same as the target soft text; and when the ratio of the text editing distance between the first webpage content and the target soft text to the length of the target soft text is judged to be not smaller than a first threshold value, determining that the first webpage content is different from the target soft text.
The first web content of the embodiment of the present application may be any one of the web contents, and the embodiment of the present application will be described below by taking the first web content as an example. The length of the target soft text of the embodiment of the application may be the number of characters of the target soft text, wherein the characters may include characters, letters, numbers and the like.
Specifically, the embodiment of the application calculates the ratio of the text editing distance between the first webpage content and the target soft text to the length of the target soft text, compares the ratio with the first threshold value, if the ratio is smaller than a first threshold value, determining that the first webpage content is the same as the target soft text, if the ratio is not smaller than the first threshold value, determining that the first webpage content is not the same as the target soft text, wherein, the first threshold value can be set according to the length of the target soft text, for example, when the length of the target soft text is longer (for example, the length of the target soft text exceeds 2000), the first threshold may be set to be larger accordingly (e.g., set to 0.38), the length of the target soft text is shorter (e.g., the length of the target soft text is less than 500), the first threshold may be set to be smaller accordingly (e.g., set to 0.3), and set to 0.35 otherwise.
According to the embodiment of the application, the text editing distance between the first webpage content and the target soft text is directly calculated, the ratio of the text editing distance between the first webpage content and the target soft text to the length of the target soft text is determined, whether the first webpage content is the same as the target soft text is determined by comparing the ratio with the first threshold, and the execution rate is high.
Preferably, in order to improve the accuracy of the statistical result, the plurality of web contents includes a first web content, the calculating the text edit distance of each of the plurality of web contents and the target soft text includes calculating the text edit distance of the first web content and the target soft text, and the calculating the text edit distance of the first web content and the target soft text includes: respectively blocking the first webpage content and the target soft text to obtain a first content block list and a second content block list, wherein the first content block list is a content block list obtained after the first webpage content is blocked, and the second content block list is a content block list obtained after the target soft text is blocked; and respectively calculating the text editing distance between each content block in the first content block list and each content block in the second content block list.
In the embodiment of the application, the first web page content and the target soft text are firstly blocked, for example, the first web page content and the target soft text are divided into a plurality of content blocks according to punctuation marks (for example, commas, periods, semicolons, and the like), so as to obtain a first content block list and a second content block list. Preferably, the embodiment of the present application may remove invalid characters (e.g., quotation marks, spaces, etc.) within each content block after dividing the first web page content and the target soft text into a plurality of content blocks, and calculate the text edit distance based on the content blocks from which the invalid characters are removed. Specifically, the embodiment of the present application may traverse the second content block list, and calculate a text editing distance between each content block in the second content block list and each content block in the first content block list.
After the text editing distance between each content block in the first content block list and each content block in the second content block list is obtained, whether the first webpage content and the target soft text are the same can be judged based on the text editing distance between each content block in the first content block list and each content block in the second content block list.
Preferably, the determining whether each web page content is the same as the target soft text according to the text edit distance of each web page content and the target soft text in the plurality of web page contents respectively comprises: acquiring content blocks in the second content block list, which are the same as the content blocks in the first content block list, according to the text editing distance between each content block in the first content block list and each content block in the second content block list; respectively counting the length of the content blocks in the second content block list, which are the same as the content blocks in the first content block list, and the length of the target soft text; calculating the ratio of the length of the content block in the second content block list, which is the same as the content block in the first content block list, to the length of the target soft text; judging whether the ratio of the length of the content block in the second content block list, which is the same as the content block in the first content block list, to the length of the target soft text is greater than a second threshold value; when the ratio of the length of the content block in the second content block list, which is the same as the content block in the first content block list, to the length of the target soft text is judged to be larger than a second threshold value, the first webpage content is determined to be the same as the target soft text; and when the ratio of the length of the content block in the second content block list, which is the same as the content block in the first content block list, to the length of the target soft text is judged to be not larger than a second threshold value, determining that the first webpage content is different from the target soft text.
Specifically, the second content block list may be traversed, text edit distances between each content block in the second content block list and each content block in the first content block list are respectively obtained, and the text edit distances between each content block in the second content block list and each content block in the first content block list judge whether each content block in the second content block list is the same as each content block in the first content block list, which is described below as an example of a first content block in the second content block list, where the first content block may be any one content block in the second content block list.
Preferably, the obtaining of the content blocks in the second content block list that are identical to the content blocks in the first content block list according to the text editing distance between each content block in the first content block list and each content block in the second content block list comprises: counting the length of the first content block; respectively calculating the ratio of the text editing distance between each content block and the first content block in the first content block list to the length of the first content block to obtain a plurality of ratios; judging whether a ratio smaller than a third threshold value exists in the plurality of ratios; when the ratio smaller than the third threshold value does not exist in the plurality of ratios, determining that the content block same as the first content block does not exist in the first content block list; and when the ratio smaller than the third threshold value is judged to exist in the ratios, determining that the content block same as the first content block exists in the first content block list, and acquiring the first content block.
The length of the first content block of the embodiment of the present application may be the number of characters of the first content block. Specifically, after the length of the first content block is obtained, ratios of the text edit distance between the first content block and each content block in the first content block list and the length of the first content block may be respectively calculated to obtain a plurality of ratios, if the ratios smaller than the third threshold do not exist in the plurality of ratios, it is indicated that the first content block is different from each content block in the first content block list, and if the ratios smaller than the third threshold exist in the plurality of ratios, it is indicated that the content block same as the first content block exists in the first content block list, that is, the first content block is a content block in the second content block list which is the same as the content block in the first content block list, and the first content block is obtained. The third threshold may be set according to actual conditions, for example, the third threshold is set to 0.35. By performing the above operations on each content block in the second content block list, all content blocks in the second content block list that are the same as the content blocks in the first content block list can be obtained.
After all the content blocks in the second content block list that are the same as the content blocks in the first content block list are obtained, the lengths of all the content blocks in the second content block list that are the same as the content blocks in the first content block list are counted, for example, if 10 content blocks in the second content block list are the same as the content blocks in the first content block list, the lengths of the 10 content blocks are counted, specifically, the length of each of the 10 content blocks may be counted respectively, and the lengths of the 10 content blocks are summed up. In the embodiment of the application, the ratio of the length of the content block in the second content block list, which is the same as the content block in the first content block list, to the length of the target soft text is calculated and compared with the second threshold, where the second threshold may be set according to an actual situation, for example, the second threshold is set to 0.8, that is, if more than 80% of the content is the same, the first web content is considered to be the same as the target soft text content, and otherwise, the first web content is considered to be different from the target soft text content.
According to the method and the device for displaying the target soft text, whether the target soft text is the same as the target soft text or not is judged by respectively executing the operation on each webpage content in the webpage contents, and after the judgment on each webpage content in the webpage contents is completed, the number of the webpage contents which are the same as the target soft text in the webpage contents can be counted, so that the display times of the target soft text can be obtained.
Preferably, in order to facilitate the user to visually check the delivery effect of the soft text, after counting the number of the web page contents in the plurality of web page contents that are the same as the target soft text as the number of times of showing the target soft text, the method further includes: respectively obtaining the ranking of the webpage content which is the same as the target soft text in the plurality of webpage contents; and displaying the display times of the target soft text and the ranking of the webpage content which is the same as the target soft text in the plurality of webpage contents.
According to the method and the device, when the webpage content is crawled through the web crawler, the rank of the webpage content (namely the rank in the search result page) is crawled, and after the showing times of the target soft text are counted, the showing times and the ranking condition of the target soft text are shown to the user for viewing.
According to another embodiment of the present application, a method for counting the number of soft text presentations comprises the following steps:
in step S202, the user inputs a keyword to be queried.
The above-mentioned keywords to be queried are search keywords.
Step S204, the web crawler captures the web page contents of the search result page according to the keywords, and returns all the captured web page contents and sequence numbers.
The above sequence number is the ranking of the web page content in the search result page.
In step S206, any one of the web page contents is divided into several content blocks according to punctuation marks (e.g., periods, commas, semicolons, etc.).
In step S208, invalid characters (e.g., quotation marks, spaces, etc.) within the content block are removed.
Step S210, the content blocks with the invalid characters removed are grouped into a content block list 1.
Step S212, similarly, the target soft text is partitioned and invalid characters in each content block are removed, so as to obtain a content block list 2.
In step S214, the text edit distance of each content block in the content block list 1 from each content block in the content block list 2 is calculated.
For example, the text edit distance of two strings ABC and ABCD is 1.
In step S216, the same content blocks in the content block list 1 and the content block list 2 are obtained according to the text editing distance between each content block in the content block list 1 and each content block in the content block list 2.
Specifically, after calculating the text editing distance between each content block in the content block list 1 and each content block in the content block list 2, dividing each text editing distance by the corresponding original character string length to obtain a plurality of ratios, where the original character string length may be the length of the content block in the content block list 1 used for calculating the text editing distance, or may be the length of the content block in the content block list 2 used for calculating the text editing distance, for example, if the text editing distance is calculated by the content block 1 in the content block list 1 and the content block 2 in the content block list 2, the text editing distance may be divided by the length of the content block 1, or may be the length of the content block 2 divided by the text editing distance.
After obtaining the ratios, the ratios may be respectively compared with a threshold 1 (i.e., the third threshold), for example, the ratios are respectively compared with 0.35, if a certain ratio is smaller than 0.35, it indicates that the content blocks in the content block list 1 and the content blocks in the content block list 2 corresponding to the ratio are the same, otherwise, it indicates that the content blocks in the content block list 1 and the content blocks in the content block list 2 corresponding to the ratio are different.
In step S218, after obtaining the same content blocks in the content block list 1 and the content block list 2, the number of characters of the same content block is divided by the total number of characters to obtain the repetition rate.
Step S220, comparing the repetition rate with a threshold 2, and if the repetition rate is greater than the threshold 2, determining that the web page content is the same as the target soft text.
Specifically, the threshold 2 may be set to 0.8 (i.e., 80%), and the web page content is considered to be the same as the target soft text if the repetition rate is greater than 80%.
Step S222, executing step S206 to step S220 on all the web page contents crawled in step S204, so as to obtain the web page contents which are the same as the target soft text in all the crawled web page contents.
Step S224, counting the number of the same webpage contents as the target soft texts in all the crawled webpage contents as the display times of the target soft texts.
It should be noted that, in the embodiment of the present application, different blocking methods may be used to block the web page content and the target soft text, or the text editing distance between the web page content and the target soft text may be directly calculated without performing blocking to determine whether the web page content and the target soft text are the same, and the determination method is the same, and is not described herein again. In addition, the embodiment of the application can also crawl the web page content of the whole network to carry out statistics on the showing times of the target soft text.
In addition, the webpage content and the target soft text are cut into blocks and compared after operation processing, so that the comparison accuracy can be improved, and the accuracy of the statistical result is further improved.
As can be seen from the above description, the embodiment of the present application can implement more accurate text matching determination, and can automatically perform text matching.
It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowcharts, in some cases, the steps illustrated or described may be performed in an order different than presented herein.
According to another aspect of the embodiments of the present application, a device for counting the number of times of displaying soft text is provided, where the device for counting the number of times of displaying soft text may be used to perform the method for counting the number of times of displaying soft text according to the embodiments of the present application, and the method according to the embodiments of the present application may also be performed by the device for counting the number of times of displaying soft text according to the embodiments of the present application.
Fig. 2 is a schematic diagram of a device for counting the number of times of presentation of a soft text according to an embodiment of the present application, as shown in fig. 2, the device includes: a first acquisition unit 10, a calculation unit 20, a judgment unit 30 and a statistic unit 40.
The first acquiring unit 10 is configured to acquire a plurality of web page contents, where the plurality of web page contents are contents of a plurality of web pages in a search result page.
And a calculating unit 20, configured to calculate text editing distances of each of the plurality of web page contents and the target soft text, respectively.
And a judging unit 30, configured to judge whether each web page content is the same as the target soft text according to the text editing distance of each web page content and the target soft text in the multiple web page contents.
The counting unit 40 is configured to count the number of web page contents, which are the same as the target soft text, in the plurality of web page contents as the number of times of displaying the target soft text.
In the embodiment of the application, a first obtaining unit 10 obtains a plurality of web page contents, where the plurality of web page contents are contents of a plurality of web pages in a search result page; the calculation unit 20 calculates the text editing distance of each web page content and the target soft text in the plurality of web page contents respectively; the judging unit 30 judges whether each web page content is the same as the target soft text according to the text editing distance of each web page content and the target soft text in the plurality of web page contents; and the counting unit 40 counts the number of the web page contents which are the same as the target soft text in the plurality of web page contents as the display times of the target soft text. According to the method and the device, all the webpage contents matched with the search keywords are automatically acquired, the showing times of the target soft texts are counted according to the text editing distances between all the webpage contents matched with the search keywords and the target soft texts, compared with the prior art that the showing times of the soft texts are counted in a manual mode, the speed is higher, the problem that the showing times of the soft texts in the related technology are counted in a manual mode and the efficiency is lower is solved, and the effect of improving the showing times counting efficiency of the soft texts is achieved.
Preferably, the plurality of web contents includes a first web content, and the judging unit 30 includes: the first statistic module is used for counting the length of the target soft text; the first calculation module is used for calculating the ratio of the text editing distance between the first webpage content and the target soft text to the length of the target soft text; the first judgment module is used for judging whether the ratio of the text editing distance between the first webpage content and the target soft text to the length of the target soft text is smaller than a first threshold value or not; and the first determining module is used for determining that the first webpage content is the same as the target soft text when the ratio of the text editing distance between the first webpage content and the target soft text to the length of the target soft text is judged to be smaller than a first threshold value, and determining that the first webpage content is different from the target soft text when the ratio of the text editing distance between the first webpage content and the target soft text to the length of the target soft text is judged to be not smaller than the first threshold value.
Preferably, the plurality of web contents includes a first web content, and the calculation unit 20 includes: the blocking module is used for respectively blocking the first webpage content and the target soft text to obtain a first content block list and a second content block list, wherein the first content block list is a content block list obtained after the first webpage content is blocked, and the second content block list is a content block list obtained after the target soft text is blocked; and the second calculation module is used for calculating the text editing distance between each content block in the first content block list and each content block in the second content block list.
Preferably, the judging unit 30 includes: the acquisition module is used for acquiring content blocks in the second content block list, which are the same as the content blocks in the first content block list, according to the text editing distance between each content block in the first content block list and each content block in the second content block list; the second counting module is used for respectively counting the length of a content block in the second content block list, which is the same as the content block in the first content block list, and the length of the target soft text; the third calculation module is used for calculating the ratio of the length of the content block in the second content block list, which is the same as the content block in the first content block list, to the length of the target soft text; the second judging module is used for judging whether the ratio of the length of the content block in the second content block list, which is the same as the content block in the first content block list, to the length of the target soft text is greater than a second threshold value; and the second determining module is used for determining that the first webpage content is the same as the target soft text when the ratio of the length of the content block in the second content block list, which is the same as the content block in the first content block list, to the length of the target soft text is judged to be larger than a second threshold, and determining that the first webpage content is different from the target soft text when the ratio of the length of the content block in the second content block list, which is the same as the content block in the first content block list, to the length of the target soft text is judged to be not larger than the second threshold.
The device for counting the number of times of showing the soft text comprises a processor and a memory, wherein the first acquiring unit, the calculating unit, the judging unit, the counting unit and the like are all stored in the memory as program units, and the processor executes the program units stored in the memory to realize corresponding functions.
The processor comprises a kernel, and the kernel calls the corresponding program unit from the memory. The kernel can set one or more than one, and the showing times of the soft text are counted by adjusting the kernel parameters.
The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip.
The present application further provides a computer program product adapted to perform program code for initializing the following method steps when executed on a data processing device: acquiring a plurality of webpage contents, wherein the webpage contents are contents of a plurality of webpages in a search result page; respectively calculating the text editing distance of each webpage content and the target soft text in the plurality of webpage contents; judging whether each webpage content is the same as the target soft text or not according to the text editing distance of each webpage content and the target soft text in the plurality of webpage contents; and counting the number of the webpage contents which are the same as the target soft text in the plurality of webpage contents as the showing times of the target soft text.
The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.
In the above embodiments of the present application, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units may be a logical division, and in actual implementation, there may be another division, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.
The foregoing is only a preferred embodiment of the present application and it should be noted that those skilled in the art can make several improvements and modifications without departing from the principle of the present application, and these improvements and modifications should also be considered as the protection scope of the present application.

Claims (4)

1. A statistical method for showing times of soft texts is characterized by comprising the following steps:
acquiring a plurality of webpage contents, wherein the webpage contents are contents of a plurality of webpages in a search result page;
respectively calculating the text editing distance of each webpage content and the target soft text in the plurality of webpage contents;
judging whether each webpage content is the same as the target soft text or not according to the text editing distance of each webpage content and the target soft text in the plurality of webpage contents; and
counting the number of the webpage contents which are the same as the target soft text in the plurality of webpage contents as the showing times of the target soft text;
wherein the plurality of web page contents include a first web page content, the calculating the text editing distance between each web page content in the plurality of web page contents and the target soft text includes calculating the text editing distance between the first web page content and the target soft text, and the calculating the text editing distance between the first web page content and the target soft text includes:
respectively blocking the first webpage content and the target soft text to obtain a first content block list and a second content block list, wherein the first content block list is a content block list obtained after the first webpage content is blocked, and the second content block list is a content block list obtained after the target soft text is blocked; and
respectively calculating text editing distances between each content block in the first content block list and each content block in the second content block list;
wherein, respectively judging whether each webpage content is the same as the target soft text according to each webpage content in the plurality of webpage contents and the text editing distance of the target soft text comprises:
acquiring content blocks in the second content block list, which are the same as the content blocks in the first content block list, according to the text editing distance between each content block in the first content block list and each content block in the second content block list;
respectively counting the length of the content block in the second content block list, which is the same as the content block in the first content block list, and the length of the target soft text;
calculating the ratio of the length of the content block in the second content block list, which is the same as the content block in the first content block list, to the length of the target soft text;
judging whether the ratio of the length of the content block in the second content block list, which is the same as the content block in the first content block list, to the length of the target soft text is greater than a second threshold value;
when the ratio of the length of the content block in the second content block list, which is the same as the content block in the first content block list, to the length of the target soft text is judged to be larger than the second threshold, determining that the first webpage content is the same as the target soft text; and
and when the ratio of the length of the content block in the second content block list, which is the same as the content block in the first content block list, to the length of the target soft text is judged to be not greater than the second threshold, determining that the first webpage content is not the same as the target soft text.
2. The method of claim 1, wherein the second list of content blocks comprises first content blocks, and wherein obtaining content blocks in the second list of content blocks that are identical to content blocks in the first list of content blocks according to text edit distances between the content blocks in the first list of content blocks and the content blocks in the second list of content blocks comprises:
counting the length of the first content block;
respectively calculating the ratio of the text editing distance between each content block and the first content block in the first content block list to the length of the first content block to obtain a plurality of ratios;
judging whether a ratio smaller than a third threshold value exists in the plurality of ratios;
when it is determined that there is no ratio smaller than the third threshold in the plurality of ratios, determining that there is no content block in the first content block list that is the same as the first content block; and
and when the ratio smaller than the third threshold value is judged to exist in the ratios, determining that the content block identical to the first content block exists in the first content block list, and acquiring the first content block.
3. The method according to claim 1, wherein after counting the number of web page contents in the plurality of web page contents that are the same as the target soft text as the number of presentations of the target soft text, the method further comprises:
respectively obtaining the ranking of the webpage content which is the same as the target soft text in the plurality of webpage contents; and
and displaying the showing times of the target soft text and the ranking of the webpage content which is the same as the target soft text in the plurality of webpage contents.
4. A statistics device for showing times of soft text, comprising:
a first obtaining unit, configured to obtain a plurality of web page contents, where the plurality of web page contents are contents of a plurality of web pages in a search result page;
the calculation unit is used for calculating the text editing distance of each webpage content and the target soft text in the plurality of webpage contents respectively;
a judging unit, configured to judge whether each of the web page contents is the same as the target soft text according to a text editing distance between each of the web page contents and the target soft text; and
the counting unit is used for counting the number of the webpage contents which are the same as the target soft text in the plurality of webpage contents as the display times of the target soft text;
wherein the plurality of web contents includes a first web content, the calculation unit includes:
a blocking module, configured to block the first webpage content and the target text to obtain a first content block list and a second content block list, where the first content block list is a content block list obtained after the first webpage content is blocked, and the second content block list is a content block list obtained after the target text is blocked; and
the second calculation module is used for respectively calculating the text editing distance between each content block in the first content block list and each content block in the second content block list;
wherein the judging unit includes:
an obtaining module, configured to obtain, according to a text editing distance between each content block in the first content block list and each content block in the second content block list, a content block in the second content block list that is the same as the content block in the first content block list;
a second counting module, configured to count lengths of content blocks in the second content block list that are the same as the content blocks in the first content block list and lengths of the target soft text, respectively;
a third calculating module, configured to calculate a ratio between a length of a content block in the second content block list, which is the same as the content block in the first content block list, and a length of the target soft text;
a second judging module, configured to judge whether a ratio between a length of a content block in the second content block list, which is the same as a content block in the first content block list, and a length of the target soft text is greater than a second threshold; and
a second determining module, configured to determine that the first web page content is the same as the target soft text when it is determined that a ratio of a length of a content block in the second content block list, which is the same as a content block in the first content block list, to the length of the target soft text is greater than a second threshold, and determine that the first web page content is different from the target soft text when it is determined that the ratio of the length of the content block in the second content block list, which is the same as the content block in the first content block list, to the length of the target soft text is not greater than the second threshold.
CN201510850381.7A 2015-11-27 2015-11-27 Soft text display frequency statistical method and device Active CN106815196B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510850381.7A CN106815196B (en) 2015-11-27 2015-11-27 Soft text display frequency statistical method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510850381.7A CN106815196B (en) 2015-11-27 2015-11-27 Soft text display frequency statistical method and device

Publications (2)

Publication Number Publication Date
CN106815196A CN106815196A (en) 2017-06-09
CN106815196B true CN106815196B (en) 2020-07-31

Family

ID=59155493

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510850381.7A Active CN106815196B (en) 2015-11-27 2015-11-27 Soft text display frequency statistical method and device

Country Status (1)

Country Link
CN (1) CN106815196B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109598530B (en) * 2017-09-30 2021-01-22 北京国双科技有限公司 Method and device for monitoring soft text advertisement delivery
CN111194457A (en) * 2018-07-31 2020-05-22 株式会社艾飒木兰 Patent evaluation determination method, patent evaluation determination device, and patent evaluation determination program

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101350032A (en) * 2008-09-23 2009-01-21 胡辉 Method for judging whether web page content is identical or not
CN101990670A (en) * 2008-04-11 2011-03-23 微软公司 Search results ranking using editing distance and document information
CN102546034A (en) * 2012-02-07 2012-07-04 深圳市纽格力科技有限公司 Method and equipment for processing voice signals
CN103473507A (en) * 2013-09-25 2013-12-25 西安交通大学 Android malicious software detection method based on method call graph
CN104133870A (en) * 2014-07-22 2014-11-05 哈尔滨工业大学(威海) Web page similarity calculation method and web page similarity calculation device
CN104598536A (en) * 2014-12-29 2015-05-06 浙江大学 Structured processing method of distributed network information

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101990670A (en) * 2008-04-11 2011-03-23 微软公司 Search results ranking using editing distance and document information
CN101350032A (en) * 2008-09-23 2009-01-21 胡辉 Method for judging whether web page content is identical or not
CN102546034A (en) * 2012-02-07 2012-07-04 深圳市纽格力科技有限公司 Method and equipment for processing voice signals
CN103473507A (en) * 2013-09-25 2013-12-25 西安交通大学 Android malicious software detection method based on method call graph
CN104133870A (en) * 2014-07-22 2014-11-05 哈尔滨工业大学(威海) Web page similarity calculation method and web page similarity calculation device
CN104598536A (en) * 2014-12-29 2015-05-06 浙江大学 Structured processing method of distributed network information

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于编辑距离的网页去重策略;丁泽亚等;《网络新媒体技术》;20131115;第2卷(第6期);第1-7页 *

Also Published As

Publication number Publication date
CN106815196A (en) 2017-06-09

Similar Documents

Publication Publication Date Title
AU2017408801B2 (en) User keyword extraction device and method, and computer-readable storage medium
CN106815207B (en) Information processing method and device for legal referee document
US8566303B2 (en) Determining word information entropies
CN109325182B (en) Information pushing method and device based on session, computer equipment and storage medium
CN106445963B (en) Advertisement index keyword automatic generation method and device of APP platform
CN106776609B (en) Statistical method and device for website reprint quantity
WO2008106668A1 (en) User query mining for advertising matching
CN104008186A (en) Method and device for determining keywords in target text
CN105550359B (en) Webpage sorting method and device based on vertical search and server
CN105404631B (en) Picture identification method and device
CN103559313B (en) Searching method and device
CN110717801A (en) Commodity information pushing method and device
CN110889045B (en) Label analysis method, device and computer readable storage medium
CN110825977A (en) Data recommendation method and related equipment
CN109582155B (en) Recommendation method and device for inputting association words, storage medium and electronic equipment
CN107729337B (en) Event monitoring method and device
CN107608980A (en) Information-pushing method and system based on the analysis of DPI big datas
CN104462396A (en) Method and device for handing character strings
CN106649308B (en) Word segmentation and word library updating method and system
EP3301603A1 (en) Improved search for data loss prevention
CN106815196B (en) Soft text display frequency statistical method and device
CN108388556B (en) Method and system for mining homogeneous entity
CN106033444B (en) Text content clustering method and device
CN103389981A (en) Network label automatic identification method and system thereof
CN107665222B (en) Keyword expansion method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 100083 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing

Applicant after: Beijing Guoshuang Technology Co.,Ltd.

Address before: 100086 Cuigong Hotel, 76 Zhichun Road, Shuangyushu District, Haidian District, Beijing

Applicant before: Beijing Guoshuang Technology Co.,Ltd.

GR01 Patent grant
GR01 Patent grant