CN115687579B

CN115687579B - Document tag generation and matching method, device and computer equipment

Info

Publication number: CN115687579B
Application number: CN202211158183.0A
Authority: CN
Inventors: 丘文波
Original assignee: Guangzhou Shirong Information Technology Co ltd; Guangzhou Shiyuan Electronics Thecnology Co Ltd
Current assignee: Guangzhou Shirong Information Technology Co ltd; Guangzhou Shiyuan Electronics Thecnology Co Ltd
Priority date: 2022-09-22
Filing date: 2022-09-22
Publication date: 2023-08-01
Anticipated expiration: 2042-09-22
Also published as: CN115687579A

Abstract

The application belongs to the technical field of internet, and particularly relates to a method, a device and computer equipment for generating and matching document labels. The document labeling method comprises the following steps: collecting search text input by a user and clicked document name text corresponding to the search text; integrating records with the same search text but different corresponding clicked document name texts to obtain a first integration result; obtaining the longest common character string of the search text and each document name text according to the first integration result; obtaining the longest public character string with the largest click frequency in the longest public character string according to the longest public character string and the clicked times; setting the longest public character string with the largest click frequency as a label candidate word; and setting at least one of the label candidate words as a document label. The method simplifies the creation process of the document label and improves the matching degree of the search intention of the user and the document label.

Description

Document tag generation and matching method, device and computer equipment

Technical Field

The application relates to the technical field of internet, in particular to a method, a device and computer equipment for generating and matching document labels.

Background

In content searching in the vertical field, such as academic searching, community forum searching, etc., related documents need to be labeled, so that the documents required by users can be quickly matched according to the search text of the users, and the matching effect of the search text and the document label also affects the final searching effect. Currently, document tags are often edited and designed manually, so the creation process is relatively labor-intensive and in some cases the user's search intent matches the document tag to a lesser extent.

Disclosure of Invention

The main purpose of the application is to provide a method, a device and a computer device for generating and matching a document tag, which aim to solve the technical problems that the document tag is complex in creation process and the matching degree of the document tag and a user searching intention is low.

In order to achieve the above object, the present application provides a document tag generating method, including:

collecting search text input by a user and clicked document name text corresponding to the search text;

integrating records which are identical in the search text and different in the corresponding clicked document name text to obtain a first integration result, wherein the first integration result comprises the search text, each document name text and the clicked times of each document name text;

Obtaining the longest common character string of the search text and each document name text according to the first integration result;

obtaining the longest public character string with the largest click frequency in the longest public character string according to the longest public character string and the clicked times;

setting the longest public character string with the largest click frequency as a label candidate word, wherein the label candidate word is at least one;

and setting at least one of the label candidate words as a document label.

The application also provides a document tag matching method, which comprises the following steps:

acquiring search text input by a user;

generating a first label for the search text based on a document label library, wherein the document label library is constructed and obtained based on the document label generating method provided by the embodiment, and the first label comprises at least one label word;

generating a second label for each document based on the document label library, wherein the documents are stored in the document library, a plurality of documents are stored in the document library for a user to search, and the second label comprises at least one label word;

matching the first label with the second label, and setting the same part of the first label and the second label as an effective label, wherein the effective label comprises at least one label word;

Based on the first label and the second label, sequentially obtaining a label coverage score of each document, wherein the label coverage score is used for representing the matching degree of the document content and the search text;

based on the effective tags, tag compactness scores of each document are sequentially obtained, wherein the tag compactness scores are used for representing the position closeness degree of the effective tag content in the document content;

obtaining an overall tag matching score for each of the documents according to the tag coverage score and the tag compactness score;

sorting the overall label matching documents to obtain a first sorting result;

and setting the documents meeting the preset rules as documents matched with the search text according to the preset rules and the first sorting result.

In one embodiment, the step of sequentially obtaining a tag compactness score for each document based on the valid tags includes:

generating a position element according to the positions of all tag words in the effective tags in each document, wherein the position element comprises tag words and position information of the tag words;

Arranging the position elements in sequence to generate a first sequence;

acquiring a first label combination based on the first sequence, wherein the first label combination comprises all label words in the effective labels, and the position distance among all label words in the document is nearest;

and obtaining the label compactness score of the document according to the first label combination.

In one embodiment, the step of obtaining a first tag combination based on the first sequence comprises:

setting each position element in the first sequence as a target element in sequence, acquiring the position element which is closest to the target element and contains other tag words after the position of the target element, and generating a plurality of position element sequences;

respectively calculating the total distance of each tag word in each position element sequence;

and setting the position element sequence with the minimum total distance as a first label combination.

In one embodiment, the overall tag match score is obtained according to the following formula:

score＝score_cover*(1+t*score_close)，

wherein score is an overall tag match score, score_cover is a tag coverage score, score_close is a tag compactness score, and t is a weight, the weight being set based on the tag coverage score.

In one embodiment, the tag coverage score is obtained according to the following formula:

wherein n is the number of tag words in the first tag, and num_query_tag is the number of tag words in the second tag;

when the ith tag word in the first tag is completely the same as any tag word in the second tag, tag is obtained _i ＝1；

And when the ith tag word in the first tag is partially identical to any tag word in the second tag, tag _i ＝N，N∈(0，1)；

When the ith tag word in the first tag is different from any tag word in the second tag, tag is obtained _i ＝0。

In one embodiment, the tag compactness score is obtained according to the following formula:

wherein L is the total distance between the tag words in the first tag combination, M is a first preset distance threshold, and K is a second preset distance threshold.

The application also provides a document tag generating device, which comprises:

the collecting module is used for collecting search texts input by a user and clicked document name texts corresponding to the search texts;

the integration module is used for integrating the records which are the same in the search text but different in the corresponding clicked document name text to obtain a first integration result, wherein the first integration result comprises the search text, the document name texts and the clicked times of the document name texts;

The first acquisition module is used for acquiring the longest public character string of the search text and each document name text according to the first integration result;

the second acquisition module is used for acquiring the longest public character string with the largest clicking frequency in the longest public character string according to the longest public character string and the clicked times;

the label candidate word setting module is used for setting the longest public character string with the largest click frequency as a label candidate word, wherein the label candidate word is at least one;

and the document tag generation module is used for setting at least one of the tag candidate words as a document tag.

The application also provides a document tag matching device, comprising:

the search text acquisition module is used for acquiring search text input by a user;

the first tag generation module is used for generating a first tag for the search text, wherein the document tag library is constructed and obtained based on the document tag generation method provided by the embodiment, and the first tag comprises at least one tag word;

the second tag generation module is used for generating a second tag for each document based on the document tag library, wherein the documents are stored in the document library, a plurality of documents are stored in the document library for a user to search, and the second tag comprises at least one tag word;

The effective label generating module is used for matching the first label with the second label and setting the same part of the first label and the second label as an effective label, wherein the effective label comprises at least one label word;

the label coverage score acquisition module is used for sequentially acquiring label coverage scores of the documents based on the first label and the second label, wherein the label coverage scores are used for representing the matching degree of the document content and the search text;

the compactness score acquisition module is used for sequentially acquiring a label compactness score of each document based on the effective label, wherein the label compactness score is used for representing the position closeness of the effective label content in the document content;

the overall label matching score obtaining module is used for obtaining the overall label matching score of each document according to the label coverage score and the label compactness score;

the ordering module is used for ordering the overall tag matching documents to obtain a first ordering result;

and the matching document setting module is used for setting the documents meeting the preset rules as documents matched with the search text according to the preset rules and the first sorting result.

The application also provides a computer device, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to realize the steps of the document tag generation method and/or the document tag matching method provided by any embodiment.

The present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the document tag generation method and/or the document tag matching method provided in any of the above embodiments.

The document label generating and matching method, device and computer equipment provided by the application collect search text input by a user and clicked document name text corresponding to the search text; integrating records which are identical in the search text and different in the corresponding clicked document name text to obtain a first integration result, wherein the first integration result comprises the search text, each document name text and the clicking times of each document name text; obtaining the longest common character string of the search text and each document name text according to the first integration result; obtaining the longest public character string with the largest clicking frequency in the longest public character string according to the longest public character string and the clicking times; setting the longest public character string with the largest click frequency as a label candidate word, wherein the label candidate word is at least one; and setting at least one of the note candidate words as a document tag. By automatically generating the document tag and setting the longest common character string with the largest click frequency as the tag candidate word, the creation process of the document tag is simplified, and the matching degree of the search intention of the user and the document tag is improved.

Drawings

FIG. 1 is a flow chart of a document tag generation method according to an embodiment of the present application;

FIG. 2 is a flow chart of a document tag matching method according to an embodiment of the present application;

FIG. 3 is a flowchart of a method for generating a document tag library according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a prefix tree according to an embodiment of the present application;

FIG. 5 is a flowchart of step S206 in a document tag matching method according to another embodiment of the present application;

FIG. 6 is a flowchart of step S2063 in the document tag matching method according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a document tag generating apparatus according to an embodiment of the present application;

FIG. 8 is a schematic diagram of a document tag matching apparatus according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

Referring to fig. 1, an embodiment of the present application provides a document tag generating method, including steps S101-S106, and the details of each step of the method are as follows.

In one embodiment, a document tag generation method includes:

s101, collecting search text input by a user and clicked document name text corresponding to the search text;

s102, integrating records which are identical in the search text and different in the corresponding clicked document name text to obtain a first integration result, wherein the first integration result comprises the search text, the document name texts and the clicked times of the document name texts;

s103, obtaining the longest public character string of the search text and each document name text according to the first integration result;

s104, obtaining the longest public character string with the largest clicking frequency in the longest public character string according to the longest public character string and the clicked times;

s105, setting the longest public character string with the largest click frequency as a label candidate word, wherein the label candidate word is at least one;

s106, setting at least one of the label candidate words as a document label.

As described in the above step S101, the search text input by the user in the search engine and the document name text corresponding to the search click may be collected according to the search log and the click (click of the searched document) log of the user. To expand the sample data, a centralized collection of information may be performed on the search and click logs over a period of time (e.g., one month).

As described in step S102, the records of the same search text but different corresponding clicked document name texts are integrated to obtain a first integrated result, where the first integrated result includes the search text, different document name texts, and the clicked times of the different document name texts. For example, assume that the same search text that is input multiple times is "win10 blue screen" in the statistical log records, but the clicked document name text corresponding to each search result is "how win10 blue screen is done", "the processing method of the computer blue screen", "the newly purchased MAC computer blue screen" and "blue screen reloading system", and the clicked times corresponding to the clicked document name text are "10", "1" and "2", so as to integrate the log records to obtain a first integration result, where one form of the first integration result may be as shown in table 1 below:

TABLE 1 first integration results

Search text entered by a user	Clicked document name text	Number of clicks
			win10 blueScreen panel	what is done by win10 blue screen	10
win10 blue screen	Processing method of computer blue screen	10
			win10 blue screen	Newly purchased MAC computer blue screen	1
win10 blue screen	Blue screen reinstallation system	2

As described in the above steps S103 to S106, the longest common character string of the search text of the user and the text of each document name is obtained according to the first integration result; obtaining the longest public character string with the largest clicking frequency in the longest public character string according to the clicking times corresponding to the longest public character string and the text of each document name; setting the longest public character string with the largest click frequency as a label candidate word, wherein the label candidate word is at least one; at least one of the tag candidate words is set as a document tag.

The following table 2 illustrates the contents as examples: in line 2 of table 1, the longest common character string between the search text of the user and the text of each document name is "win10 blue screen", and the clicked frequency of the longest common character string is 10; in the 3 rd line of table 1, the longest common character string of the user's search text and each document name text is "blue screen" and the frequency of being clicked is 10, in the 4 th line of table 1, the longest common character string of the user's search text and each document name text is also "blue screen" and the frequency of being clicked is 1, in the 5 th line of table 1, the longest common character string of the user's search text and each document name text is also "blue screen" and the frequency of being clicked is 2, that is, the frequency of being clicked of the longest common character string "blue screen" is (10+1+2), that is, 13, therefore, the "blue screen" is set as a note candidate word, in other embodiments, there may be longest public strings with the same frequency of being clicked but different text contents, for example, the longest public strings are "win10 blue screen" and "blue screen", and the frequency of being clicked of the two longest public strings is "10", and "10" is the largest frequency of being clicked in the result of this integration, then both "win10 blue screen" and "blue screen" are set as candidate tag words, and in practical application, one of "win10 blue screen" and "blue screen" may be set as a document tag, and both "win10 blue screen" and "blue screen" may be set as a document tag. In practical applications, the number of candidate tag words is not limited to one or two, but may be two or more, and is not limited herein.

TABLE 2 first integration result post-treatment

In this way, it can be predicted that, in practical application, a plurality of related tag words may be obtained, in order to remove redundancy and ensure the meaning of the document tag words (for example, when prepositions such as "ground" are also defined as document tag words, but there is no specific meaning), in some embodiments, document tag words with a length that does not meet the preset requirement (for example, the preset requirement is that the length is not 1 and the length is not greater than 10) may be removed, and if one long document tag word may be composed of short document tag words, long document tag words may be removed, and duplication removal may be performed to preserve the short tag words.

In some embodiments, when a character string consisting of letters or letters exists in the search text of the user and the text of the clicked document name, in order to avoid errors in text matching (the case format of letters is not identical in appearance but the meaning expressed in the text is generally the same), the letters existing in all the texts are subjected to uniform format conversion, such as all are set to a lower case format, or all are set to a higher case format.

Referring to fig. 2, the embodiment of the present application further provides a document tag matching method, which includes steps S201 to S209, and the details of each step of the method are as follows.

In one embodiment, the document tag library matching method includes:

s201, acquiring a search text input by a user;

s202, generating a first label for the search text based on a document label library, wherein the document label library is constructed and obtained based on the document label generating method provided by the embodiment, and the first label comprises at least one label word;

s203, generating a second label for each document based on the document label library, wherein the documents are stored in the document library, a plurality of documents are stored in the document library for searching by a user, and the second label comprises at least one label word;

s204, matching the first label with the second label, and setting the same part of the first label and the second label as an effective label, wherein the effective label comprises at least one label word;

s205, based on the first label and the second label, sequentially obtaining label coverage scores of each document, wherein the label coverage scores are used for representing the matching degree of the document content and the search text;

s206, based on the effective labels, label compactness scores of all the documents are obtained in sequence, wherein the label compactness scores are used for representing the position approaching degree of the effective label content in the document content;

S207, obtaining an overall label matching score of each document according to the label coverage score and the label compactness score;

s208, sorting the overall label matching documents to obtain a first sorting result;

s209, setting the document meeting the preset rule as a document matched with the search text according to the preset rule and the first sorting result.

As described in the above steps S201 to S204, when it is detected that the user inputs a search text in the search engine, the search text input by the user is acquired; generating a first label for the search text based on the pre-selected generated document label library; generating a second label for each document in a document library (such as a hundred-degree library, a knowledge net theory library and the like) based on a pre-generated document label library, wherein the document is stored in the document library, and a plurality of documents are stored in the document library for searching by a user; and matching the first label with the second label, and setting the same part of the first label and the second label as an effective label, wherein the first label, the second label and the effective label comprise at least one label word.

Illustratively, when the search text entered by the user is "how do windows blue screen? Generating a first label ('windows', 'win', 'blue screen') for the search text based on a pre-generated document label library; the content of one document in the document library is a ' win10 computer blue screen reloading system … … ', a second label (' win ', ' win10 ', ' computer ', ' blue screen ' reloading system ') is generated for the document based on the pre-generated document label library, and the second label is generated for each document in the document library in the same way; the first label ("windows", "win", "blue screen") is matched with the second label ("win", "win10", "computer", "blue screen", "reload system") and the same portion of the first label and the second label is set as an effective label, where the first label, the second label and the effective label include at least one label word, and in this embodiment, the effective label is "win", "blue screen").

In some embodiments, referring to fig. 3, the document tag library generating method includes:

s301, generating a plurality of document tags based on a plurality of user search texts and the document tag generation method provided by the embodiment;

s302, generating a document tag library based on the plurality of document tags.

As described in the above steps 301-S302, in order to facilitate the automatic generation of tags for documents and the simplification of the matching process of the tags for documents, a document tag library may be generated in advance based on a large amount of sample data for calling in actual application, thereby improving efficiency. Specifically, a plurality of document tags are generated based on a plurality of user search texts (i.e., different search texts input by a user and clicked document name texts and clicked times corresponding to search results) and the document tag generation method provided in the above embodiment; and generating a document tag library based on the plurality of document tags.

In order to improve the statistics and searching efficiency of the character strings, a prefix tree technology may be introduced in the database generation process, a prefix tree is constructed by using a plurality of document tags, and the prefix tree constructed by the plurality of document tags is set as a document tag library (the prefix tree contains all the document tags). The prefix tree is also called dictionary tree, word search tree and Trie tree, is a multi-path tree structure, is a variant of hash tree, and is a multi-path tree structure for quick search. Typical applications are for counting and ranking a large number of strings (but not limited to strings), so it is often used by search engine systems for text word frequency statistics, which has the advantage that: unnecessary character string comparison is reduced to the maximum extent, and the query efficiency is high. For example, referring to FIG. 4, when there is a set of document tags: inn, int, at, age, adv, ant, ate, a prefix tree as shown in fig. 4 can be constructed from the set of document tags.

As described in step S205, a tag coverage score of each document in the document library is sequentially obtained based on the obtained first tag and second tag, where the tag coverage score is used to characterize the matching degree of the content of the document and the search text, and the higher the tag coverage score, the higher the matching degree of the content of the document and the search text is. In some embodiments, the coverage score described above may be obtained from the following equation:

wherein n is the number of tag words in the first tag, and num __ is the number of tag words in the second tag;

when the ith tag word in the first tag is completely the same as any tag word in the second tag, then tag _i ＝1；

And when the ith tag word in the first tag is partially identical to any tag word in the second tag, tag _i ＝N，N∈(0,1)；

In this embodiment, N may be 0.7. Taking the example that the first label is ("windows", "win", "blue screen") and the second label is ("win", "win10", "computer", "blue screen", "reload system") to calculate the above formula for obtaining the label coverage score, then:

As described in step S206, based on the effective tags, tag compactness scores of each document in the document library are sequentially obtained, wherein the tag compactness scores are used for representing the close degree of the effective tag content in the document content, and the higher the tag compactness scores, the closer the positions among the tag words in the effective tags are, namely the more the effective tags conform to the actual search intention of the user.

In some embodiments, referring to fig. 5, the step of sequentially obtaining a tag compactness score of each document based on the valid tags includes:

s2061, generating a position element according to the positions of all tag words in the effective tags in each document, wherein the position element comprises tag words and position information of the tag words;

s2062, arranging the position elements in sequence to generate a first sequence;

s2063, acquiring a first label combination based on the first sequence, wherein the first label combination comprises all label words in the effective labels, and the position distance among all label words in the document is nearest;

s2064, obtaining the label compactness score of the document according to the first label combination.

Generating a position element according to the positions of all tag words in the valid tags in the document content in each document according to the steps S2061-S2064, wherein the position element comprises the tag words and the position information of the tag words; arranging the position elements in sequence to generate a first sequence; based on the first sequence, a first label combination is obtained, wherein the first label combination comprises all label words in the effective labels, and the position distance among all label words in the document is nearest.

For example, when there is a document "document X", where the effective tag of the content of the document includes 3 note words { a, B, C }, a location element may be generated according to the locations of the 3 tag words, for example, (C, 2), where the location of the tag word "C" is represented as a second character in the document tag, and assuming that the location element corresponding to the 3 tag words is obtained, a first sequence generated by sequentially arranging the location elements is:

[(C,2),(A,5),(B,10),(C,12),(A,14),(B,23),(A,33),(C,50)]。

the tag combinations [ (C, 2), (a, 5), (B, 10) ], [ (a, 5), (B, 10), (C, 12) ] and the like each include all tag words in the valid tag, and after finding all similar tag combinations, the tag combination having the closest position between all tag words can be selected as the first tag combination. In some embodiments, referring to fig. 6, the step of obtaining the first tag combination based on the first sequence includes:

S2063a, sequentially setting each position element in the first sequence as a target element, acquiring the position element which is positioned behind the target element and closest to the target element and contains other tag words, and generating a plurality of position element sequences;

s2063b, respectively calculating the total distance of each tag word in each position element sequence;

s2063c, setting the position element sequence with the smallest total distance as the first tag combination.

As described in the above steps S2063a-S2063c, the first sequence is assumed to be, illustratively:

the method includes the steps of [ (C, 2), (a, 5), (B, 10), (C, 12), (a, 14), (B, 23), (a, 33), (C, 50) ] setting each positional element in the first sequence as a target element in turn, and acquiring positional elements including other tag words in the valid tag that are located after the target element position and closest to the target element to generate a plurality of positional element sequences, in this embodiment, the generated plurality of positional element sequences including: [ (C, 2), (a, 5), (B, 10) ], [ (a, 5), (B, 10), (C, 12) ], [ (B, 10), (C, 12), (a, 14) ], [ (C, 12), (a, 14), (B, 23) ], [ (a, 14), (B, 23), (C, 50) ], [ (B, 23), (a, 33), (C, 50) ].

After finding all the position element sequences, calculating the total distance of each tag word in each position element sequence, taking the position element sequence [ (C, 2), (a, 5), (B, 10) ] as an example, the position of the tag word C is 2, the position of the tag word a is 5, the position of the tag word B is 10, the distance between the tag word C and the a is 3 characters, the distance between the tag word a and the tag word B is 4 characters, therefore, the total distance of each tag word in the position element sequence [ (C, 2), (a, 5), (B, 10) ] is 7 characters, the total distance of each tag word in the rest position element sequences is calculated in the same manner, the position element sequence with the highest cohesion (i.e. the minimum total distance) can be found as [ (B, 10), (C, 12), (a, 14) ], and the position element sequence is set as the first tag combination, wherein the first tag combination comprises all tag words in the valid tags, and the distance between all the tag words in the valid tags in the document is the nearest.

In some embodiments, the tag compactness score described above may be obtained according to the following formula:

wherein L is the total distance of each tag word in the first tag combination, M is a first preset distance threshold, and K is a second preset distance threshold.

In this embodiment, the value of M may be 5,K, that is, when the total distance L between the tag words in the first tag combination is less than 5, the tag compactness score score_close is 1, when the total distance between the tag words in the first tag combination is greater than 20, the tag compactness score_close is 0, and when the total distance between the tag words in the first tag combination is between 5 and 20, the tag compactness score score_close is 1/L. It should be noted that, in other embodiments, the values of M and K may be set according to actual design requirements, which is not limited herein.

As described in the above step S207, an overall tag matching score for each document is obtained from the tag coverage score and the tag compactness score obtained by the above calculation. In some embodiments, the overall tag match score may be obtained according to the following formula:

score＝score_cover*(1+t*score_close)，

the score is an overall tag matching score, the score_cover is a tag coverage score, the score_close is a tag compactness score, and t is a weight, and the weight is set based on the tag coverage score. For example, when the score_cover is greater than 0.9, t is 1, and when the score_cover is 0.9 or less, the score_cover is 0. In other embodiments, the value of t may be set according to the actual design requirement, which is not limited herein.

As described in the above steps S208-S209, the overall tag matching scores of all the documents in the document library are calculated by the above method of obtaining the overall tag matching scores, and then the documents are ranked according to the order of the overall tag matching scores, so as to obtain a first ranking result, for example, the document library includes document a, document B, document C and document D, where the ranking of the overall tag matching scores of the documents is as follows: document a < document B < document C < document D; and setting the documents meeting the preset rules as documents matched with the search text input by the user at the present time according to the preset rules and the first sorting result. For example, when the preset rule is to select the document with the top three ranks of the overall tag matching scores in the document library as the document matched by the search text input by the user at the present time, in this embodiment, the document B, the document C and the document D are selected as the documents matched by the search text input by the user at the present time for the user to search.

The overall label matching score is obtained by combining the label coverage score and the label compactness score, the overall label matching score is used as a measurement standard to jointly judge the matching degree of the document label and the user search text, and the document with the overall label matching score meeting the preset rule is selected as the document matched by the search text input by the user at the present time for the user to search, so that the matching degree among the document label, the document content and the user search text (namely the actual search intention) can be better improved.

The document label generating and matching method comprises the steps of collecting search text input by a user and clicked document name text corresponding to the search text; integrating records which are identical in the search text and different in the corresponding clicked document name text to obtain a first integration result, wherein the first integration result comprises the search text, each document name text and the clicking times of each document name text; obtaining the longest common character string of the search text and each document name text according to the first integration result; obtaining the longest public character string with the largest clicking frequency in the longest public character string according to the longest public character string and the clicking times; setting the longest public character string with the largest click frequency as a label candidate word, wherein the label candidate word is at least one; and setting at least one of the note candidate words as a document tag. By automatically generating the document tag and setting the longest common character string with the largest click frequency as the tag candidate word, the creation process of the document tag is simplified, and the matching degree of the search intention of the user and the document tag is improved.

Referring to fig. 7, an embodiment of the present application further provides a document tag generating apparatus, including:

a collection module 701, configured to collect a search text input by a user and a clicked document name text corresponding to the search text;

the integrating module 702 is configured to integrate the records that have the same search text but correspond to different clicked document name texts to obtain a first integration result, where the first integration result includes the search text, each document name text, and the clicked times of each document name text;

a first obtaining module 703, configured to obtain, according to the first integration result, a longest common character string of the search text and each document name text;

a second obtaining module 704, configured to obtain, according to the longest public character string and the clicked times, a longest public character string with a largest click frequency in the longest public character string;

a tag candidate word setting module 705, configured to set the longest public character string with the largest click frequency as a tag candidate word, where the tag candidate word is at least one;

a document tag generation module 706, configured to set at least one of the tag candidate words as a document tag.

In this embodiment, the collection module 701 may collect, according to a search log of a user and a log of clicks (documents obtained by clicking a search), a search text input by the user in a search engine and a document name text corresponding to the click of the search. To expand the sample data, a centralized collection of information may be performed on the search and click logs over a period of time (e.g., one month).

The integration module 702 integrates the records with the same search text but different corresponding clicked document name texts to obtain a first integration result, where the first integration result includes the search text, different document name texts, and the clicked times of the different document name texts. For example, it is assumed that the same search text that is input multiple times is "win10 blue screen" in the statistical log records, but the clicked document name text corresponding to each search result is "how win10 blue screen is processed", "the processing method of the computer blue screen", "the newly purchased MAC computer blue screen" and "blue screen reloading system", and the clicked times corresponding to the clicked document name text are "10", "1" and "2", respectively, and the first integration result is obtained when integrating the log records.

The first obtaining module 703 obtains the longest common string of the search text of the user and the text of each document name according to the first integration result; the second obtaining module 704 obtains the longest public character string with the largest click frequency in the longest public character string according to the above-mentioned longest public character string and the clicked times corresponding to the text of each document name; the tag candidate word setting module 705 sets the longest common character string with the largest click frequency as a tag candidate word, where the tag candidate word is at least one; the document tag generation module 706 sets at least one of the above-described note candidate words as a document tag. Still illustrated by the above-mentioned examples: in the above embodiment, the frequency of being clicked of the longest common strings "win10 blue screen" and "blue screen" is 10 times and 13 times, respectively, so "blue screen" is set as a candidate word for a note, and in other embodiments, there may be the longest common strings with the same frequency of being clicked but different text contents, for example, the longest common strings are "win10 blue screen" and "blue screen" respectively, and the frequency of being clicked of the two longest common strings is "10", and the frequency of being clicked of "10" is the largest frequency of being clicked in the integration result, both "win10 blue screen" and "blue screen" are set as candidate tag words, and in practical application, one of "win10 blue screen" and "blue screen" may be set as document tags, and both "win10 blue screen" and "blue screen" may be set as document tags. In practical applications, the number of candidate tag words is not limited to one or two, but may be two or more, and is not limited herein.

In practical applications, a plurality of related tag words may be obtained in the above manner, in order to remove redundancy and ensure the meaning of the document tag words (for example, when the prepositions such as "ground" are also defined as document tag words, but there is no specific meaning), in some embodiments, the document tag words with a length that does not meet the preset requirement (for example, the preset requirement is not 1 and the length cannot be greater than 10) may be removed, and if one long document tag word may be composed of short document tag words, the long document tag words may also be removed, and the short tag words may be retained by de-duplication.

Referring to fig. 8, an embodiment of the present application further provides a document tag matching device, including:

A search text acquisition module 801, configured to acquire a search text input by a user;

a first tag generating module 802, configured to generate a first tag for the search text based on a document tag library, where the document tag library is constructed and obtained based on the document tag generating method provided in the above embodiment, and the first tag includes at least one tag word;

a second tag generating module 803, configured to generate a second tag for each document based on the document tag library, where the documents are stored in the document library, and a plurality of documents are stored in the document library for searching by a user, and the second tag includes at least one tag word;

an effective tag generating module 804, configured to match the first tag with the second tag, and set the same portion of the first tag as the second tag as an effective tag, where the effective tag includes at least one tag word;

a tag coverage score obtaining module 805, configured to obtain, based on the first tag and the second tag, a tag coverage score of each document in turn, where the tag coverage score is used to characterize a matching degree of the document content and the search text;

A compactness score obtaining module 806, configured to obtain, based on the valid tags, a tag compactness score of each document in turn, where the tag compactness score is used to characterize a location closeness of the tag content in the document content;

an overall tag match score acquisition module 807 for obtaining an overall tag match score for each of the documents based on the tag coverage score and the tag compactness score;

a ranking module 808, configured to rank the overall tag matching documents to obtain a first ranking result;

and a matching document setting module 809, configured to set, according to a preset rule and the first ranking result, the document that satisfies the preset rule as a document that matches the search text.

In this embodiment, when it is detected that the user inputs a search text in the search engine, the search text input by the user is acquired through the search text acquisition module 801; generating a first tag for the search text based on the pre-selected generated document tag library by the first tag generation module 802; the second tag generating module 803 generates a second tag for each document in the document library (such as hundred degree library, knowledge net theory library, etc.) based on the document tag library which has been generated in advance, wherein the document is stored in the document library, and a plurality of documents are stored in the document library for searching by a user; the effective label generating module 804 matches the first label with the second label, and sets the same part of the first label and the second label as an effective label, where the first label, the second label and the effective label include at least one label word.

In this embodiment, in order to facilitate the automatic generation of tags for documents and the simplification of the matching process of the tags for the documents, a document tag library may be generated based on a large amount of sample data for calling during actual application, thereby improving efficiency. Specifically, a plurality of document tags are generated based on a plurality of user search texts (i.e., different search texts input by a user and clicked document name texts and clicked times corresponding to search results) and the document tag generation method provided in the above embodiment; and generating a document tag library based on the plurality of document tags.

In some embodiments, in order to improve the statistics and searching efficiency of the character strings, a technology of a prefix tree may be introduced in the process of generating the document tag library, a prefix tree is constructed by using a plurality of document tags, and the prefix tree constructed by the plurality of document tags is set as the document tag library (the prefix tree includes all the document tags). The prefix tree is also called dictionary tree, word search tree, trie tree, which is a multi-path tree structure, is a variation of hash tree, and is a multi-path tree structure for quick search. Typical applications are for counting and ranking a large number of strings (but not limited to strings), so it is often used by search engine systems for text word frequency statistics, which has the advantage that: unnecessary character string comparison is reduced to the maximum extent, and the query efficiency is high.

In this embodiment, the tag coverage score obtaining module 805 is further configured to obtain, in sequence, a tag coverage score of each document in the document library based on the obtained first tag and second tag, where the tag coverage score is used to characterize a matching degree of a content of the document and the search text, and the higher the tag coverage score, the higher the matching degree of the content of the document and the search text is; and sequentially obtaining a tag compactness score of each document in the document library based on the effective tags through the compactness score obtaining module 806, wherein the tag compactness score is used for representing the position approaching degree of the effective tag content in the document content, and the higher the tag compactness score is, the closer the positions among tag words in the effective tag is, namely the more the effective tag accords with the real searching intention of a user; the overall label matching score of each document in the document library is obtained through the label coverage score and the label compactness score obtained through the calculation through the volume label matching score obtaining module 807; and then, sorting according to the overall tag matching scores by a sorting module 808 to obtain a first sorting result, for example, a document library comprises a document A, a document B, a document C and a document D, wherein the sorting of the overall tag matching scores of the documents is as follows: document a < document B < document C < document D; finally, through the matching document setting module 809, according to the preset rule and the first sorting result, the document meeting the preset rule is set as the document matched with the search text input by the user at the present time. For example, when the preset rule is to select the document with the top three ranks of the overall tag matching scores in the document library as the document matched by the search text input by the user at the present time, in this embodiment, the document B, the document C and the document D are selected as the documents matched by the search text input by the user at the present time for the user to search.

It may be understood that each component of the document tag generating device and the document tag matching device provided in the present application may respectively implement the functions of any one of the document tag generating method, the document tag library generating method and the document tag matching method provided in any one of the foregoing embodiments, and the specific structure is not repeated.

Referring to fig. 9, in an embodiment of the present application, a computer device is further provided, and an internal structure of the computer device may be as shown in fig. 9. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the computer is configured to provide computing and control capabilities. The memory of the computer device includes a storage medium, an internal memory. The storage medium stores an operating system, computer programs, and a database. The memory provides an environment for the operating system and computer programs in the storage media to run. The database of the computer device is used for storing relevant data of a document tag generation method. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program, when executed by a processor, implements one or more of the document tag generation method, the document tag library generation method, and the document tag matching method provided in any of the above embodiments.

The embodiment of the application further provides a computer readable storage medium, which may be nonvolatile or volatile, and a computer program is stored on the computer readable storage medium, and when the computer program is executed by a processor, one or more methods of the document tag generation method, the document tag library generation method and the document tag matching method provided in any one of the embodiments are implemented.

It will be appreciated by those skilled in the art that implementing all or part of the above-described embodiment method may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and the computer program may include the above-described embodiment method when executed. Any reference to memory, storage, database, or other medium provided herein and used in embodiments may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual speed data rate SDRAM (SSRSDRAM), extended SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

The document tag generating method, the document tag matching method, the document tag generating device and the document tag matching device collect search text input by a user and clicked document name text corresponding to the search text; integrating records which are identical in the search text and different in the corresponding clicked document name text to obtain a first integration result, wherein the first integration result comprises the search text, each document name text and the clicking times of each document name text; obtaining the longest common character string of the search text and each document name text according to the first integration result; obtaining the longest public character string with the largest clicking frequency in the longest public character string according to the longest public character string and the clicking times; setting the longest public character string with the largest click frequency as a label candidate word, wherein the label candidate word is at least one; and setting at least one of the note candidate words as a document tag. By automatically generating the document tag and setting the longest common character string with the largest click frequency as the tag candidate word, the creation process of the document tag is simplified, and the matching degree of the search intention of the user and the document tag is improved.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, apparatus, article or method that comprises the element.

The foregoing description is only of the preferred embodiments of the present application, and is not intended to limit the scope of the claims, and all equivalent structures or equivalent processes using the descriptions and drawings of the present application, or direct or indirect application in other related technical fields are included in the scope of the claims of the present application.

Claims

1. The method for generating and matching the document label is characterized by comprising the following steps:

obtaining the longest public character string with the largest clicking times in the longest public character string according to the longest public character string and the clicked times;

setting the longest public character string with the largest clicking times as a label candidate word, wherein the label candidate word is at least one;

setting at least one of the tag candidate words as a document tag;

the matching method of the document tag comprises the following steps:

acquiring search text input by a user;

generating a first label for the search text based on a document label library, wherein the document label library is constructed and obtained based on the document label obtained by the document label generating method, and the first label comprises at least one label word;

sorting the overall label matching documents to obtain a first sorting result;

2. The document tag generation and matching method according to claim 1, wherein the step of sequentially obtaining a tag compactness score for each of the documents based on the valid tags comprises:

arranging the position elements in sequence to generate a first sequence;

3. The document tag generation and matching method according to claim 2, wherein the step of obtaining a first tag combination based on the first sequence comprises:

4. The document tag generation and matching method according to claim 1, wherein the overall tag matching score is obtained according to the following formula:

，

wherein,,for the overall tag match score, < >>For tag coverage score,/->For a tag compactness score, t is a weight, which is set based on the tag coverage score.

5. The document tag generation and matching method of claim 1, wherein the tag coverage score is obtained according to the following formula:

，

wherein,,for the number of tag words in said first tag, -/->The number of the tag words in the second tag;

and when the first label is the first labelWhen the tag word is identical to any one of the second tags, then +.>；

And when the first label is the first labelWhen the tag word is partially identical to any one of the tag words in the second tag, then +.>；

And when the first label is the first labelWhen the tag words are different from any one of the second tags, the tag words are +.>。

6. The document tag generation and matching method of claim 2, wherein the tag compactness score is obtained according to the following formula:

7. The document label generating and matching device is characterized in that the document label generating device comprises:

the collection module is used for collecting search texts input by a user and clicked document name texts corresponding to the search texts;

the integration module is used for integrating the records which are the same in the search text but different in the corresponding clicked document name text to obtain a first integration result, wherein the first integration result comprises the search text, the document name texts and the clicking times of the document name texts;

the second acquisition module is used for acquiring the longest public character string with the largest clicking times in the longest public character string according to the longest public character string and the clicking times;

the label candidate word setting module is used for setting the longest public character string with the largest clicking times as a label candidate word, wherein the label candidate word is at least one;

The document tag generation module is used for setting at least one of the tag candidate words as a document tag;

the document tag matching device comprises:

the first tag generation module is used for generating a first tag for the search text, wherein the document tag library is constructed and obtained based on a generation method of the document tag, and the first tag comprises at least one tag word;

the label coverage score acquisition module is used for sequentially acquiring label coverage scores of the documents based on the first label and the second label, wherein the coverage scores are used for representing the matching degree of the document content and the search text;

The compactness score acquisition module is used for sequentially acquiring a label compactness score of each document based on the effective labels, wherein the compactness score is used for representing the position closeness of the label content in the document content;

8. A computer device comprising a memory and a processor, the memory having stored therein a computer program, characterized in that the processor, when executing the computer program, carries out the steps of the method according to any of claims 1-6.

9. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method according to any of claims 1-6.