CN112835923A - Correlation retrieval method, device and equipment - Google Patents

Correlation retrieval method, device and equipment Download PDF

Info

Publication number
CN112835923A
CN112835923A CN202110141804.3A CN202110141804A CN112835923A CN 112835923 A CN112835923 A CN 112835923A CN 202110141804 A CN202110141804 A CN 202110141804A CN 112835923 A CN112835923 A CN 112835923A
Authority
CN
China
Prior art keywords
target
keyword
content
retrieval
keywords
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110141804.3A
Other languages
Chinese (zh)
Inventor
兰亭
徐琳玲
张闯
强锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Industrial and Commercial Bank of China Ltd ICBC
Original Assignee
Industrial and Commercial Bank of China Ltd ICBC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Industrial and Commercial Bank of China Ltd ICBC filed Critical Industrial and Commercial Bank of China Ltd ICBC
Priority to CN202110141804.3A priority Critical patent/CN112835923A/en
Publication of CN112835923A publication Critical patent/CN112835923A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/248Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • Fuzzy Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the specification provides a related retrieval method, a related retrieval device and related retrieval equipment, and relates to the technical field of big data, wherein the method comprises the following steps: determining a target retrieval keyword of retrieval content input by a user; acquiring a target associated data set; the target associated data set is constructed by using an association analysis algorithm, the associated data set comprises a plurality of groups of data, and each group of data comprises a keyword and at least one corresponding related keyword; determining target related keywords corresponding to the target retrieval keywords based on the target associated data set; and searching by using the target search keyword and the target related keyword to obtain a plurality of search results. In the embodiment of the specification, the relevant content which does not contain the target retrieval keyword can be retrieved by using the target relevant keyword, the comprehensiveness of the retrieval result is effectively improved, the relevant retrieval result can be accurately queried for the user, and the user experience is improved.

Description

Correlation retrieval method, device and equipment
Technical Field
The embodiment of the specification relates to the technical field of big data, in particular to a related retrieval method, a related retrieval device and related retrieval equipment.
Background
At present, when searching in big data, fuzzy matching is mainly carried out on keywords according to contents input by a user, but related contents which do not contain the keywords are easy to miss when the searching is carried out in the mode. Therefore, information related to the content input by the user cannot be comprehensively retrieved by the retrieval scheme in the related art.
In view of the above problems, no effective solution has been proposed.
Disclosure of Invention
The embodiment of the specification provides a related retrieval method, a related retrieval device and related retrieval equipment, and aims to solve the problem that information related to content input by a user cannot be comprehensively retrieved in the prior art.
An embodiment of the present specification provides a correlation search method, including: determining a target retrieval keyword of retrieval content input by a user; acquiring a target associated data set; the target associated data set is constructed by using an association analysis algorithm, the associated data set comprises a plurality of groups of data, and each group of data comprises a keyword and at least one corresponding related keyword; determining target related keywords corresponding to the target retrieval keywords based on the target associated data set; and searching by using the target search keyword and the target related keyword to obtain a plurality of search results.
An embodiment of the present specification further provides a related search apparatus, including: the first determining module is used for determining a target retrieval keyword of retrieval content input by a user; the acquisition module is used for acquiring a target associated data set; the target associated data set is constructed by using an association analysis algorithm, the associated data set comprises a plurality of groups of data, and each group of data comprises a keyword and at least one corresponding related keyword; a second determining module, configured to determine, based on the target associated data set, a target related keyword corresponding to the target search keyword; and the retrieval module is used for retrieving by utilizing the target retrieval keywords and the target related keywords to obtain a plurality of retrieval results.
The embodiment of the specification further provides a related retrieval device which comprises a processor and a memory for storing processor executable instructions, wherein the processor executes the instructions to realize the steps of the related retrieval method.
The present specification also provides a computer readable storage medium, on which computer instructions are stored, and when executed, the instructions implement the steps of the related retrieval method.
The embodiment of the specification provides a correlation search method, which can determine a target search keyword of search content input by a user and acquire a target correlation data set, wherein the target correlation data set is constructed by using a correlation analysis algorithm. Because the associated data set comprises a plurality of groups of data, and each group of data comprises a keyword and at least one corresponding related keyword, the target related keyword corresponding to the target retrieval keyword can be determined based on the target associated data set. Furthermore, the target search keyword and the target related keyword can be used for searching to obtain a plurality of search results. Therefore, the related content which does not contain the target retrieval keyword can be retrieved by utilizing the target related keyword, the view angle for observing the retrieval content is widened, the comprehensiveness of the retrieval result is effectively improved, the related retrieval result can be accurately inquired for a user, and the user experience is improved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the embodiments of the disclosure, are incorporated in and constitute a part of this specification, and are not intended to limit the embodiments of the disclosure. In the drawings:
FIG. 1 is a schematic diagram illustrating steps of a related search method according to an embodiment of the present disclosure;
fig. 2 is a schematic structural diagram of a related search device provided in an embodiment of the present specification;
fig. 3 is a schematic structural diagram of a related search device provided in an embodiment of the present specification.
Detailed Description
The principles and spirit of the embodiments of the present specification will be described with reference to a number of exemplary embodiments. It should be understood that these embodiments are presented merely to enable those skilled in the art to better understand and to implement the embodiments of the present description, and are not intended to limit the scope of the embodiments of the present description in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
As will be appreciated by one skilled in the art, implementations of the embodiments of the present description may be embodied as a system, an apparatus, a method, or a computer program product. Therefore, the disclosure of the embodiments of the present specification can be embodied in the following forms: entirely hardware, entirely software (including firmware, resident software, micro-code, etc.), or a combination of hardware and software.
Although the flow described below includes operations that occur in a particular order, it should be appreciated that the processes may include more or less operations that are performed sequentially or in parallel (e.g., using parallel processors or a multi-threaded environment).
Referring to fig. 1, the present embodiment can provide a related search method. The related retrieval method can be used for comprehensively and accurately retrieving the information related to the retrieval content input by the user. The above-mentioned correlation retrieval method may include the following steps.
S101: and determining a target search keyword of the search content input by the user.
In this embodiment, since the user inputs the content to be retrieved in the input box of the interface corresponding to the target search engine when performing the retrieval, the target retrieval keyword of the retrieval content input by the user may be determined first in order to determine the intention of the user retrieval and improve the effectiveness of the retrieval. The target search keyword may be one or more, and may be determined specifically according to an actual situation, which is not limited in this specification.
In this embodiment, the search content input by the user may be one or more words, may be a sentence, or may be a paragraph, which may be determined according to actual situations, and this is not limited in this description example. Since the search content input by the user may contain some redundant information or the search content input by the user may not accurately express the user's intention, the search cannot be performed accurately if the search is performed directly within the search input by the user. The target search keyword of the search content can be determined first, so that the search intention of the user can be determined. For example, what the search content input by the user is the weather of beijing, and the target search keywords are: the weather condition of the user who wants to inquire the weather condition of the Beijing can be determined, and the retrieval efficiency and accuracy are effectively improved.
S102: acquiring a target associated data set; the target associated data set is constructed by using an associated analysis algorithm, the associated data set comprises a plurality of groups of data, and each group of data comprises a keyword and at least one corresponding related keyword.
In this embodiment, a target associated data set may be obtained in advance, where the target associated data set may be used to represent keywords having correlation therebetween, the target associated data set may be constructed by using an association analysis algorithm, the associated data set may include multiple sets of data, and each set of data may include one keyword and at least one corresponding related keyword.
In this embodiment, the target associated data set may be used to determine whether a keyword has a related keyword, and which related keywords exist. The target related data may be stored in the form of a table, a text, an image, or the like, and may be determined according to actual conditions, which is not limited in the embodiments of the present specification.
In this embodiment, the division strategy of the correlation analysis algorithm (FP-Growth) is: the database providing the frequent item set is compressed to a frequent pattern tree (FP-tree), but the item set association information is still retained. The frequent pattern tree is a special prefix tree and is composed of a frequent item head table and an item prefix tree, and the association analysis algorithm can perform mining based on the structure of the frequent pattern tree.
In this embodiment, the manner of acquiring the target-related data set may include: and pulling the data from a preset database. It is understood that, of course, the sample data set may also be obtained in other possible manners, for example, the sample data set may be obtained by querying according to a preset path, which may be determined specifically according to an actual situation, and this is not limited in this embodiment of the present specification.
S103: and determining target related keywords corresponding to the target retrieval keywords based on the target associated data set.
In this embodiment, the target related keyword corresponding to the target search keyword may be specified based on the target related data set. The number of the target related keywords corresponding to the target search keyword may be one or multiple, and in some cases, the target search keyword may not have a related keyword, which may be determined specifically according to an actual situation, and this is not limited in the embodiments of this specification.
In this embodiment, the target related keyword of the target search keyword may be specified from the related keyword corresponding to the keyword recorded in the target related data set, and thus other expression words or words related thereto of the target search keyword may be specified. For example, relevant keywords for tomato are: tomatoes; relevant keywords in suzhou are: wu Zhong district, Industrial park district, Gusu district, New district, metropolitan district, Wu Jiang district, etc.
S104: and searching by using the target search keyword and the target related keyword to obtain a plurality of search results.
In the present embodiment, in order to improve comprehensiveness and accuracy of the search, the target search keyword and the identified target related keyword may be used to perform the search at the same time, so that a plurality of search results may be obtained and displayed to the user. Compared with a mode of directly utilizing the target search keyword to search, the method and the device can search the related content which does not contain the keyword, and improve the comprehensiveness and accuracy of the search result.
From the above description, it can be seen that the embodiments of the present specification achieve the following technical effects: the target retrieval keyword of the retrieval content input by the user can be determined, and a target associated data set is obtained, wherein the target associated data set is constructed by using an association analysis algorithm. Because the associated data set comprises a plurality of groups of data, and each group of data comprises a keyword and at least one corresponding related keyword, the target related keyword corresponding to the target retrieval keyword can be determined based on the target associated data set. Furthermore, the target search keyword and the target related keyword can be used for searching to obtain a plurality of search results. Therefore, the related content which does not contain the target retrieval keyword can be retrieved by utilizing the target related keyword, the view angle for observing the retrieval content is widened, the comprehensiveness of the retrieval result is effectively improved, the related retrieval result can be accurately inquired for a user, and the user experience is improved.
In one embodiment, before obtaining the target associated data set, the method may further include: and determining keywords corresponding to each content recorded in the target database, and establishing a corresponding relation between each content and the keywords. The target support degree may be set according to the correspondence between each content and the keyword. Constructing a frequent pattern tree of each keyword according to the target support degree by using a correlation analysis algorithm; wherein each node in the frequent pattern tree represents a keyword. Further, a target associated data set may be constructed based on the frequent pattern tree.
In this embodiment, the related keywords corresponding to different keywords may be determined using the content data stored in the history, and the target related data set may be constructed. The target database may be a database of a search engine, or may also be a database used for storing content data in a website or an application, which may be determined according to an application scenario, and this is not limited in this embodiment of the present specification. The target database may store contents for a user to search, for example, if a search engine is used to search documents, the target database may store various documents; if the search engine is for searching for food, the target database may have stored therein a menu of store names and stores.
In this embodiment, the correspondence between each content and the keyword may be content 1: keyword 1, keyword 2; content 2: the form records of the keywords 1 and the keywords 3 may also be recorded in other possible forms, which may be determined according to actual situations, and this is not limited in the embodiments of this specification.
In this embodiment, the support degree is a parameter of the association analysis algorithm, and the support degree may be a probability that a certain keyword appears in the correspondence between the content and the keyword, and may be used to describe the importance of the keyword. If the support degree is smaller, the times of occurrence of the keywords per se can be considered to be smaller, and under the condition of larger data quantity, the operation quantity can be reduced by screening the keywords, and the performance is improved. The target support degree may be a minimum support degree, and may be the minimum importance that the association rule concerned by the user must satisfy, and only the keyword satisfying the minimum support degree may generate the association rule. The target support degree may be customized according to the corresponding relationship between each content and the keyword, and specifically may be determined according to the number of times of occurrence of each keyword, so as to prevent the target support degree from being too large or too small. The target support degree may be any decimal number greater than 0, for example: 0.2, 0.4, etc., which can be determined according to practical conditions, and the examples in this specification do not limit this.
In this embodiment, elements that do not meet the target support requirement will not appear in the last frequent pattern tree, which connects similar elements by links, and the connected elements can be considered as a linked list. After the keywords of each content are sorted according to the support degree, the keywords of each content are sequentially inserted into a tree with NULL as a root node according to the support degree, each node in the frequent pattern tree represents one keyword, and the occurrence frequency of the node can be recorded at each node.
In the present embodiment, the purpose of the association analysis algorithm is to find the keyword or the keyword set with the largest occurrence number among the multiple occurring keywords, where the largest occurrence number refers to the occurrence probability being greater than or equal to a given threshold (target support). The number of times of finding a single data item is simple, and only traversal counting is needed, but the number of occurrences of a combination of keywords, i.e., a keyword set, is difficult to determine, for example, the number of occurrences of a certain data item a and a certain data item B is frequent, but the number of occurrences of the combination of the data items, i.e., the number of occurrences of the data items B at the same time is not frequent. The combination of keywords that occur more frequently in the data set may be referred to as a frequent item set, and the keywords in the frequent item set may be related keywords. Therefore, the associated keywords can be efficiently and accurately mined by using the association analysis algorithm, and a data basis is provided for determining the target related keywords corresponding to the target retrieval keywords.
In one embodiment, determining the keywords corresponding to each content recorded in the target database may include: under the condition that the target content recorded in the target database has the corresponding keyword row, acquiring the keyword row corresponding to the target content, and preprocessing the keyword row corresponding to the target content to obtain the keyword corresponding to the target content; wherein the pretreatment comprises: splitting the keyword behavior according to the delimiter. Under the condition that the target content recorded in the target database is determined not to have the corresponding keyword row, the target content can be obtained, and the target content is preprocessed to obtain the keyword corresponding to the target content; wherein the pretreatment comprises: word segmentation and stop words.
In this embodiment, whether a keyword line corresponding to the content is recorded in the target database may be predetermined, and if it is determined that the keyword line is recorded in the target database, a preprocessing operation such as splitting a plurality of keywords from the delimiter may be performed on the keyword line, so as to obtain the keyword corresponding to the content. And directly acquiring the content under the condition of no determination, and performing preprocessing operations such as word segmentation and word stop on the content to obtain the keywords corresponding to the content.
In this embodiment, the keyword line may be a profile corresponding to the content, and the keyword line may be used to represent a core concept of the content. For example, in the case where the content is a book, the keyword line may be a profile of the book. Of course, the keyword row is not limited to the above examples, and other modifications may be made by those skilled in the art within the spirit of the embodiments of the present disclosure, and all that is needed is to cover the scope of the embodiments of the present disclosure as long as the functions and effects achieved by the embodiments of the present disclosure are the same or similar to the embodiments of the present disclosure.
In one embodiment, building a target associated data set based on a frequent pattern tree may include: and screening out related keywords of each keyword based on the frequent pattern tree, and establishing a corresponding relation between each keyword and the related keywords to obtain an initial associated data set. Furthermore, a related word score table can be obtained; the related word score table is used for representing the degree of correlation between any two keywords. Optimizing the initial associated data set based on the related word scoring table to obtain a target associated data set; wherein the optimization process comprises adding related keywords and deleting related keywords.
In this embodiment, the related keywords of each keyword may be screened out based on the frequent pattern tree, assuming that a certain keyword in the frequent pattern tree is the current node and the occurrence frequency thereof is B, the floating occurrence frequency C and the floating occurrence frequency D may be set as parameters, and in the nodes before and after the current node, the screening of the occurrence frequency is performed in the interval [ B-D, B + C [ ]]And establishing the corresponding relation between the keywords and the related keywords by the nodes in the range according to the keywords meeting the screening requirement. For example: the current node is m (keyword m), the occurrence frequency is B, if the front X of the node m1The occurrence frequency of the nodes m1 and m2 in each node is [ B-D, B + C]Back X of node m within range2The occurrence frequency of each node m3 and m4 is [ B-D, B + C]Within the range, the keyword-related keyword corresponding relationship established by the node m is the keyword m: related keyword m1, related keyword m2 and related keywordsm3, and related keywords m 4. The nodes in the frequent pattern tree can be traversed according to the steps, and the corresponding relation between the keywords of each node and the related keywords is generated.
In this embodiment, the number of occurrences may be the number of occurrences of the keyword, the number of occurrences B may be an integer greater than 0, and the number of occurrences of the floating-up and the floating-down may be integers greater than 0, such as: 2. 3, etc., which can be determined according to practical conditions, and are not limited in the embodiments of the present specification.
In this embodiment, the initial associated data set may record related keywords of each keyword, for example, a keyword 1: related keywords 1, 2; keyword 2: the related keywords 1 and 3 are recorded, and the initial associated data set may be stored in the form of a table, text, image, and the like. Of course, the initial associated data set is not limited to the above examples, and other modifications may be made by those skilled in the art within the spirit of the embodiments of the present disclosure, and all such modifications are intended to be included within the scope of the embodiments of the present disclosure as long as they achieve the same or similar functions and effects as the embodiments of the present disclosure.
In the embodiment, since the target related data set mined by the association analysis algorithm may have missing or associated related keywords which are not very high, the target related data set may be corrected and optimized by using the related word score table. The related term scoring table can be constructed by using historical search data and can be used for representing the degree of correlation between any two keywords, and the higher the degree of correlation is, the more likely the related keywords are.
In this embodiment, the historical click rate of each content in the target search engine may be obtained, and a related term score table is constructed by using a preset number of contents before the click rate is ranked as basic data. High-frequency words in each content can be determined, and any two high-frequency words in each content are recorded as a pair of data, wherein the initial score of each pair of data is 0. Furthermore, the number of times that any two different high-frequency words co-occur in the same content can be counted, if the number of times that the two different high-frequency words co-occur in the same content is greater than or equal to a third preset threshold, the two first high-frequency words are considered as related keywords, and the score of the degree of correlation of the two first high-frequency words in the related word score table can be added with 1.
In this embodiment, on the premise that the number of times that two second high-frequency words co-occur in the same content is 0, the number of times that the two second high-frequency words do not co-occur in the same content may be counted. In the case that the number of times that two second high-frequency words do not co-occur in the same content is greater than or equal to a fourth preset threshold, the score of the degree of correlation of the two second high-frequency words in the related word score table may be reduced by 1. The higher the score in the related word score table is, the higher the degree of correlation between the two keywords is, and the lower the score is, the lower the degree of correlation between the two keywords is.
In this embodiment, whether or not the related keyword needs to be retained may be determined based on the scores of the two related keywords recorded in the initial related data set in the related word score table. In some cases, keyword 2 may be added to the related keywords of keyword 1 when the relevancy score between keyword 1 and keyword 2 in the related term score table is high, but no keyword 2 is present in the related keywords of keyword 1 in the initial associated data set. Of course, the way of optimization is not limited to the above examples, and other modifications are possible for those skilled in the art in light of the technical spirit of the embodiments of the present disclosure, but the functions and effects achieved by the embodiments of the present disclosure are all covered by the scope of the embodiments of the present disclosure as long as they are the same as or similar to the embodiments of the present disclosure
In this embodiment, the initial associated data set may be further optimized by using the related term score table, so that the accuracy of the correspondence between the keywords recorded in the associated data set and the related keywords may be effectively improved, and the accuracy of the target related keywords corresponding to the determined target search keywords may be further improved.
In one embodiment, the optimizing the initial associated data set based on the related term score table to obtain the target associated data set may include: and determining the score of each related keyword corresponding to the target keyword in the initial associated data set based on the related word score table. When the first related keyword corresponding to the target keyword exists in the related word score table and the score is less than or equal to the first preset threshold, the first related keyword may be deleted. And under the condition that the second related keywords with the correlation degrees with the target keywords being more than or equal to the second preset threshold value in the related word scoring table do not exist in the initial related data set, adding the second related keywords to the related keywords corresponding to the target keywords to obtain the target related data set.
In this embodiment, a first preset threshold and a second preset threshold may be preset, where the first preset threshold and the second preset threshold are both values greater than 0, and the first preset threshold may be equal to the second preset threshold or may not be equal to the second preset threshold, and may be determined specifically according to an actual situation, which is not limited in this embodiment of the present specification.
In this embodiment, the initial associated data set may be further optimized by using the related term score table, so that the accuracy of the correspondence between the keywords recorded in the associated data set and the related keywords may be effectively improved, and the accuracy of the target related keywords corresponding to the determined target search keywords may be further improved.
In one embodiment, after the searching is performed by using the target searching keyword and the target related keyword to obtain a plurality of searching results, the method may further include: and calculating the correlation degree of each retrieval result with the target retrieval key words and the target related key words, and performing descending order arrangement on each retrieval result according to the correlation degree of each retrieval result with the target retrieval key words and the target related key words. Furthermore, the retrieval results after the descending order can be displayed to the user.
In the present embodiment, since not every search result is desired by the user, that is, different search results may have different degrees of correlation with the target search keyword and the target related keyword. Therefore, the retrieval results can be displayed in the corresponding interface of the user according to the sequence from high to low of the correlation degree of each retrieval result, the target retrieval key words and the target related key words, so that the user can efficiently find the retrieval result with high correlation degree with the target retrieval content, and the user experience is effectively improved.
In one embodiment, the degree of correlation between each search result and the target search keyword and the target related keyword may be calculated according to the following formula:
Figure BDA0002928875180000091
wherein y is the correlation degree of the retrieval result, the target retrieval key words and the target related key words; f is the number of target search keywords and target related keywords appearing in a search result; h is the total number of words of the retrieval result; i is a variable, i is more than or equal to 1 and less than or equal to M, and M is the total number of the target retrieval keywords and the target related keywords; giThe number of times of occurrence of the ith keyword in a retrieval result;
Figure BDA0002928875180000092
the total number of times that the target search keyword and the target related keyword appear in one search result is set.
In the present embodiment, it is assumed that the target search keyword and the target related keyword include: m1, M2, M3, M3, the number of occurrences of M1 in search result 1 is 1, the number of occurrences of M2 is 2, and the number of occurrences of M3 is 0. F2 (one is m1, one is m2, m3 is not counted within F since m3 occurs 0); g ═ 1+2+0 ═ 3 (number of occurrences of m1, 1+ number of occurrences of m2, 2+ number of occurrences of m3, 0).
In one embodiment, before obtaining the target associated data set, the method may further include: counting the click rate of each content in the target search engine according to a preset time interval, sequencing each content in the target search engine according to the click rate, and taking the preset number of contents before sequencing as target contents. Furthermore, the high-frequency words of each target content can be determined, and the corresponding relation between the target content and the high-frequency words is obtained. Acquiring a related word scoring table; the related word scoring table comprises scores of the relevancy of any two keywords, and the related word scoring table is updated according to the corresponding relation between the target content and the high-frequency words.
In this embodiment, the preset time interval may be one day, one week, or one month, and may be determined according to actual conditions, which is not limited in the examples of this specification. The click rate of each content in the target search engine in a preset time interval may be obtained, for example, when the preset time interval is one day, the click rate of each content in the target search engine from zero to 24 times of the day may be counted, and of course, in some embodiments, the click rate of each content may also be counted according to all historical click data of each content in the target search engine before 24 times of the day. Specifically, the determination may be made according to actual conditions, and the embodiment of the present specification does not limit this.
In this embodiment, since the content with a high click rate is more referential, a preset number of contents before the ranking can be used as the target content. The preset number may be an integer greater than 0, for example: 10. 20, 36, etc., may be determined according to actual conditions, and the embodiments of the present disclosure are not limited thereto. The high-frequency word of the target content may be a word with a high frequency of occurrence in the target content, and since the high-frequency words of the same content may be related words, the related word score table may be updated according to the correspondence between the target content and the high-frequency words. Therefore, the related word scoring table can be continuously optimized based on the feedback of the user, and the accuracy of the related word scoring table is effectively improved.
In one embodiment, updating the related word score table according to the correspondence between the target content and the high-frequency word may include: according to the corresponding relation between the target content and the high-frequency words, the frequency of the co-occurrence of any two different high-frequency words in the same content is determined, two first high-frequency words with the frequency of the co-occurrence in the same content being greater than or equal to a third preset threshold value can be used as related keywords, and the score of the degree of correlation of the two first high-frequency words in a related word score table is added with 1. Further, in the case where it is determined that there are two second high-frequency words whose number of times of co-occurrence with the same content is 0, the number of times that the two second high-frequency words do not co-occur with the same content may be determined. In the case that the number of times that two second high-frequency words do not co-occur in the same content is greater than or equal to a fourth preset threshold, the score of the degree of correlation of the two second high-frequency words in the related word score table may be reduced by 1.
In this embodiment, a third preset threshold and a fourth preset threshold may be preset, where the third preset threshold and the fourth preset threshold may both be numerical values greater than 0, the first preset threshold may be equal to the second preset threshold, or may not be equal to the second preset threshold, and may be determined specifically according to an actual situation, which is not limited in this embodiment of the present specification.
In this embodiment, the number of times that any two different high-frequency words co-occur in the same content may be counted, and if the number of times that any two different high-frequency words co-occur in the same content is greater than or equal to a third preset threshold, the two first high-frequency words are considered as related keywords, and the score of the degree of correlation of the two first high-frequency words in the related word score table may be added by 1.
In this embodiment, on the premise that the number of times that two second high-frequency words co-occur in the same content is 0, the number of times that the two second high-frequency words do not co-occur in the same content may be counted. For example, the target content includes: target content 1: high-frequency words 1 and 2; target content 2: high frequency word 1, high frequency word 3. The high-frequency word 1 and the high-frequency word 2 are co-occurred, so that the non-co-occurrence times are not counted; the high-frequency word 1 and the high-frequency word 3 are co-occurred, so that the non-co-occurrence times are not counted; the number of co-occurrence times of the high-frequency word 1 and the high-frequency word 3 is 0, so that the number of times of non-co-occurrence in the same content is counted as 2. The higher the score in the related word score table is, the higher the degree of correlation between the two keywords is, and the lower the score is, the lower the degree of correlation between the two keywords is.
In the embodiment, whether any two high-frequency words are related keywords can be determined by counting the frequency of the co-occurrence of any two different high-frequency words in the same content, so that the related word scoring table is further optimized continuously, and the accuracy of the related word scoring table is effectively improved.
In one embodiment, determining the high-frequency words for each target content may include: preprocessing each target content to obtain a plurality of words contained in each target content; wherein the pretreatment comprises: word segmentation and stop words. Furthermore, words with frequency greater than or equal to a fifth preset threshold appearing in the same target content may be used as high-frequency words of the target content, so as to obtain high-frequency words of each target content.
In this embodiment, each target content may be split into a plurality of words by the word segmentation tool, and the plurality of split words may be filtered by deactivating the words, so that a plurality of words included in each target content may be obtained. The fifth preset threshold may be a value greater than 0, for example, 2, 3, and the like, which may be determined according to practical situations, and this is not limited in this embodiment of the specification. Words with the frequency greater than or equal to the fifth preset threshold appearing in the same target content can be regarded as high-frequency words of the target content, and therefore the corresponding relation between the target content and the high-frequency words can be generated efficiently and accurately.
Based on the same inventive concept, the embodiment of the present specification further provides a related search device, such as the following embodiments. Because the principle of the relevant retrieval device for solving the problem is similar to the relevant retrieval method, the implementation of the relevant retrieval device can refer to the implementation of the relevant retrieval method, and repeated details are not repeated. As used hereinafter, the term "unit" or "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated. Fig. 2 is a block diagram of a structure of a related search device according to an embodiment of the present disclosure, and as shown in fig. 2, the related search device may include: the first determining module 201, the obtaining module 202, the second determining module 203, and the retrieving module 204 are described below.
A first determining module 201, configured to determine a target search keyword of search content input by a user;
an obtaining module 202, which may be configured to obtain a target associated data set; the target associated data set is constructed by using an associated analysis algorithm, the associated data set comprises a plurality of groups of data, and each group of data comprises a keyword and at least one corresponding related keyword;
a second determining module 203, configured to determine, based on the target associated data set, a target related keyword corresponding to the target search keyword;
the search module 204 may be configured to perform a search using the target search keyword and the target related keyword to obtain a plurality of search results.
The embodiment of the present specification further provides an electronic device, which may specifically refer to a schematic structural diagram of the electronic device based on the related retrieval method provided by the embodiment of the present specification, and the electronic device may specifically include an input device 31, a processor 32, and a memory 33. The input device 31 may be specifically used to input search content. The processor 32 may be specifically configured to determine a target search keyword of the search content input by the user; acquiring a target associated data set; the target associated data set is constructed by using an associated analysis algorithm, the associated data set comprises a plurality of groups of data, and each group of data comprises a keyword and at least one corresponding related keyword; determining target related keywords corresponding to the target retrieval keywords based on the target associated data set; and searching by using the target search keyword and the target related keyword to obtain a plurality of search results. The memory 33 may be specifically used for storing parameters such as a plurality of search results.
In this embodiment, the input device may be one of the main apparatuses for information exchange between a user and a computer system. The input devices may include a keyboard, mouse, camera, scanner, light pen, handwriting input panel, voice input device, etc.; the input device is used to input raw data and a program for processing the data into the computer. The input device can also acquire and receive data transmitted by other modules, units and devices. The processor may be implemented in any suitable way. For example, the processor may take the form of, for example, a microprocessor or processor and a computer-readable medium that stores computer-readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, an Application Specific Integrated Circuit (ASIC), a programmable logic controller, an embedded microcontroller, and so forth. The memory may in particular be a memory device used in modern information technology for storing information. The memory may include multiple levels, and in a digital system, memory may be used as long as binary data can be stored; in an integrated circuit, a circuit without a physical form and with a storage function is also called a memory, such as a RAM, a FIFO and the like; in the system, the storage device in physical form is also called a memory, such as a memory bank, a TF card and the like.
In this embodiment, the functions and effects specifically realized by the electronic device can be explained by comparing with other embodiments, and are not described herein again.
Embodiments of the present specification further provide a computer storage medium based on a correlation retrieval method, where the computer storage medium stores computer program instructions, and when the computer program instructions are executed, the computer storage medium may implement: determining a target retrieval keyword of retrieval content input by a user; acquiring a target associated data set; the target associated data set is constructed by using an associated analysis algorithm, the associated data set comprises a plurality of groups of data, and each group of data comprises a keyword and at least one corresponding related keyword; determining target related keywords corresponding to the target retrieval keywords based on the target associated data set; and searching by using the target search keyword and the target related keyword to obtain a plurality of search results.
In this embodiment, the storage medium includes, but is not limited to, a Random Access Memory (RAM), a Read-Only Memory (ROM), a Cache (Cache), a Hard Disk Drive (HDD), or a Memory Card (Memory Card). The memory may be used to store computer program instructions. The network communication unit may be an interface for performing network connection communication, which is set in accordance with a standard prescribed by a communication protocol.
In this embodiment, the functions and effects specifically realized by the program instructions stored in the computer storage medium can be explained by comparing with other embodiments, and are not described herein again.
It will be apparent to those skilled in the art that the modules or steps of the embodiments of the present specification described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed over a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different from that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, embodiments of the present description are not limited to any specific combination of hardware and software.
Although the embodiments herein provide the method steps as described in the above embodiments or flowcharts, more or fewer steps may be included in the method based on conventional or non-inventive efforts. In the case of steps where no causal relationship is logically necessary, the order of execution of the steps is not limited to that provided by the embodiments of the present description. When the method is executed in an actual device or end product, the method can be executed sequentially or in parallel according to the embodiment or the method shown in the figure (for example, in the environment of a parallel processor or a multi-thread processing).
It is to be understood that the above description is intended to be illustrative, and not restrictive. Many embodiments and many applications other than the examples provided will be apparent to those of skill in the art upon reading the above description. The scope of embodiments of the present specification should, therefore, be determined not with reference to the above description, but should instead be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled.
The above description is only a preferred embodiment of the embodiments of the present disclosure, and is not intended to limit the embodiments of the present disclosure, and it will be apparent to those skilled in the art that various modifications and variations can be made in the embodiments of the present disclosure. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the embodiments of the present disclosure should be included in the protection scope of the embodiments of the present disclosure.

Claims (13)

1. A correlation retrieval method, comprising:
determining a target retrieval keyword of retrieval content input by a user;
acquiring a target associated data set; the target associated data set is constructed by using an association analysis algorithm, the associated data set comprises a plurality of groups of data, and each group of data comprises a keyword and at least one corresponding related keyword;
determining target related keywords corresponding to the target retrieval keywords based on the target associated data set;
and searching by using the target search keyword and the target related keyword to obtain a plurality of search results.
2. The method of claim 1, prior to obtaining the target associated dataset, further comprising:
determining keywords corresponding to each content recorded in a target database;
establishing a corresponding relation between each content and the keyword;
setting target support degrees according to the corresponding relation between each content and the keywords;
constructing a frequent pattern tree of each keyword according to the target support degree by using an association analysis algorithm; each node in the frequent pattern tree represents a keyword;
constructing the target associated dataset based on the frequent pattern tree.
3. The method of claim 2, wherein constructing the target association dataset based on the frequent pattern tree comprises:
screening out related keywords of each keyword based on the frequent pattern tree;
establishing a corresponding relation between each keyword and related keywords to obtain an initial associated data set;
acquiring a related word scoring table; the related word scoring table is used for representing the degree of correlation between any two keywords;
optimizing the initial associated data set based on the related word scoring table to obtain a target associated data set; wherein the optimization process comprises adding related keywords and deleting related keywords.
4. The method of claim 2, wherein determining keywords corresponding to each content recorded in the target database comprises:
under the condition that the target content recorded in the target database has the corresponding keyword row, acquiring the keyword row corresponding to the target content;
preprocessing a keyword line corresponding to the target content to obtain a keyword corresponding to the target content; wherein the pre-processing comprises: splitting the keyword behavior into a plurality of keywords according to the separators;
under the condition that the target content recorded in the target database is determined to have no corresponding keyword row, acquiring the target content;
preprocessing the target content to obtain a keyword corresponding to the target content; wherein the pre-processing comprises: word segmentation and stop words.
5. The method of claim 3, wherein optimizing the initial associated dataset based on the related term score table to obtain a target associated dataset comprises:
determining the score of each related keyword corresponding to the target keyword in the initial associated data set based on the related word score table;
deleting a first relevant keyword corresponding to a target keyword under the condition that the score of the first relevant keyword is smaller than or equal to a first preset threshold value;
and under the condition that a second relevant keyword with the degree of correlation with the target keyword in the relevant word grading table being greater than or equal to a second preset threshold does not exist in the initial relevant data set, adding the second relevant keyword to the relevant keyword corresponding to the target keyword to obtain the target relevant data set.
6. The method according to claim 1, wherein after the searching using the target search keyword and the target related keyword to obtain a plurality of search results, further comprising:
calculating the correlation degree of each retrieval result with the target retrieval key words and the target related key words;
according to the correlation degree of each retrieval result with the target retrieval key words and the target related key words, performing descending order arrangement on each retrieval result;
and displaying each retrieval result after the descending order to the user.
7. The method according to claim 6, wherein the degree of correlation between each search result and the target search keyword and the target related keyword is calculated according to the following formula:
Figure FDA0002928875170000021
wherein y is the correlation degree of the retrieval result, the target retrieval key words and the target related key words; f is the number of the target search keywords and the target related keywords appearing in one search result; h is the total number of words of the retrieval result; i is a variable, i is more than or equal to 1 and less than or equal to M, and M is the total number of the target retrieval keywords and the target related keywords; giFor the number of times of occurrence of the ith keyword in a search result;
Figure FDA0002928875170000031
The total number of times that the target search keyword and the target related keyword appear in one search result is obtained.
8. The method of claim 1, wherein prior to obtaining the target associated dataset, further comprising:
counting the click rate of each content in the target search engine according to a preset time interval;
sequencing all contents in the target search engine according to the click rate, and taking a preset number of contents before sequencing as target contents;
determining high-frequency words of each target content to obtain the corresponding relation between the target content and the high-frequency words;
acquiring a related word scoring table; the related word score table comprises scores of the relevancy of any two keywords;
and updating the related word score table according to the corresponding relation between the target content and the high-frequency words.
9. The method according to claim 8, wherein updating the related word score table according to the correspondence between the target content and the high-frequency word comprises:
determining the frequency of the co-occurrence of any two different high-frequency words in the same content according to the corresponding relation between the target content and the high-frequency words;
taking two first high-frequency words which are co-occurring in the same content and have the frequency more than or equal to a third preset threshold value as related keywords, and adding 1 to the scores of the relevancy of the two first high-frequency words in the related word score table;
under the condition that two second high-frequency words with the frequency of co-occurrence in the same content being 0 exist, determining the frequency of non-co-occurrence of the two second high-frequency words in the same content;
and under the condition that the frequency of the two second high-frequency words not co-occurring in the same content is greater than or equal to a fourth preset threshold value, subtracting 1 from the score of the relevancy of the two second high-frequency words in the related word score table.
10. The method of claim 8, wherein determining high frequency words for each target content comprises:
preprocessing each target content to obtain a plurality of words contained in each target content; wherein the pre-processing comprises: word segmentation and word stop;
and taking words with the frequency more than or equal to a fifth preset threshold value in the same target content as high-frequency words of the target content to obtain the high-frequency words of each target content.
11. A correlation retrieval apparatus, comprising:
the first determining module is used for determining a target retrieval keyword of retrieval content input by a user;
the acquisition module is used for acquiring a target associated data set; the target associated data set is constructed by using an association analysis algorithm, the associated data set comprises a plurality of groups of data, and each group of data comprises a keyword and at least one corresponding related keyword;
a second determining module, configured to determine, based on the target associated data set, a target related keyword corresponding to the target search keyword;
and the retrieval module is used for retrieving by utilizing the target retrieval keywords and the target related keywords to obtain a plurality of retrieval results.
12. A correlation retrieval device comprising a processor and a memory for storing processor-executable instructions which, when executed by the processor, implement the steps of the method of any one of claims 1 to 10.
13. A computer-readable storage medium having stored thereon computer instructions which, when executed, implement the steps of the method of any one of claims 1 to 10.
CN202110141804.3A 2021-02-02 2021-02-02 Correlation retrieval method, device and equipment Pending CN112835923A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110141804.3A CN112835923A (en) 2021-02-02 2021-02-02 Correlation retrieval method, device and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110141804.3A CN112835923A (en) 2021-02-02 2021-02-02 Correlation retrieval method, device and equipment

Publications (1)

Publication Number Publication Date
CN112835923A true CN112835923A (en) 2021-05-25

Family

ID=75931594

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110141804.3A Pending CN112835923A (en) 2021-02-02 2021-02-02 Correlation retrieval method, device and equipment

Country Status (1)

Country Link
CN (1) CN112835923A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113486071A (en) * 2021-07-27 2021-10-08 掌阅科技股份有限公司 Searching method, server, client and system based on electronic book
CN117743606A (en) * 2024-02-21 2024-03-22 天云融创数据科技(北京)有限公司 Intelligent retrieval method and system based on big data

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20100092145A (en) * 2009-02-12 2010-08-20 엔에이치엔(주) System and method for search modeling using relation dictionary
WO2017063538A1 (en) * 2015-10-12 2017-04-20 广州神马移动信息科技有限公司 Method for mining related words, search method, search system
CN110609952A (en) * 2019-08-15 2019-12-24 中国平安财产保险股份有限公司 Data acquisition method and system and computer equipment
CN111625621A (en) * 2020-04-27 2020-09-04 中国铁道科学研究院集团有限公司电子计算技术研究所 Document retrieval method and device, electronic equipment and storage medium
CN111897919A (en) * 2020-08-04 2020-11-06 广西财经学院 Text retrieval method based on Copulas function and pseudo-correlation feedback rule expansion

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20100092145A (en) * 2009-02-12 2010-08-20 엔에이치엔(주) System and method for search modeling using relation dictionary
WO2017063538A1 (en) * 2015-10-12 2017-04-20 广州神马移动信息科技有限公司 Method for mining related words, search method, search system
CN110609952A (en) * 2019-08-15 2019-12-24 中国平安财产保险股份有限公司 Data acquisition method and system and computer equipment
CN111625621A (en) * 2020-04-27 2020-09-04 中国铁道科学研究院集团有限公司电子计算技术研究所 Document retrieval method and device, electronic equipment and storage medium
CN111897919A (en) * 2020-08-04 2020-11-06 广西财经学院 Text retrieval method based on Copulas function and pseudo-correlation feedback rule expansion

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113486071A (en) * 2021-07-27 2021-10-08 掌阅科技股份有限公司 Searching method, server, client and system based on electronic book
CN117743606A (en) * 2024-02-21 2024-03-22 天云融创数据科技(北京)有限公司 Intelligent retrieval method and system based on big data
CN117743606B (en) * 2024-02-21 2024-04-30 天云融创数据科技(北京)有限公司 Intelligent retrieval method and system based on big data

Similar Documents

Publication Publication Date Title
Horng et al. Applying genetic algorithms to query optimization in document retrieval
JP4881322B2 (en) Information retrieval system based on multiple indexes
EP1622053B1 (en) Phrase identification in an information retrieval system
CA2813644C (en) Phrase-based searching in an information retrieval system
US7797265B2 (en) Document clustering that applies a locality sensitive hashing function to a feature vector to obtain a limited set of candidate clusters
US10579661B2 (en) System and method for machine learning and classifying data
EP1622055B1 (en) Phrase-based indexing in an information retrieval system
US20170161375A1 (en) Clustering documents based on textual content
US7702618B1 (en) Information retrieval system for archiving multiple document versions
US7711668B2 (en) Online document clustering using TFIDF and predefined time windows
US7580929B2 (en) Phrase-based personalization of searches in an information retrieval system
US7644047B2 (en) Semantic similarity based document retrieval
US10430448B2 (en) Computer-implemented method of and system for searching an inverted index having a plurality of posting lists
EP1622052A1 (en) Phrase-based generation of document description
US20080147642A1 (en) System for discovering data artifacts in an on-line data object
WO2008144457A2 (en) Efficient retrieval algorithm by query term discrimination
CN108509490B (en) Network hot topic discovery method and system
CN112835923A (en) Correlation retrieval method, device and equipment
CN110889023A (en) Distributed multifunctional search engine of elastic search
WO2015094889A2 (en) Trending analysis for streams of documents
CN117609318A (en) Score sorting optimization method, device, equipment and storage medium
Tourad et al. A novel indexing algorithm for content-based Publish/Subscribe systems in a Big Data environment
Sridharan et al. RELEVANT UPDATED DATA RETRIEVAL ARCHITECTURAL MODEL FOR CONTINUOUS TEXT EXTRACTION
AU2006246519A1 (en) Method, System and Software Product for Locating Documents of Interest

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination