CN112765966A - Method and device for removing duplicate of associated word, computer readable storage medium and electronic equipment - Google Patents

Method and device for removing duplicate of associated word, computer readable storage medium and electronic equipment Download PDF

Info

Publication number
CN112765966A
CN112765966A CN202110368415.4A CN202110368415A CN112765966A CN 112765966 A CN112765966 A CN 112765966A CN 202110368415 A CN202110368415 A CN 202110368415A CN 112765966 A CN112765966 A CN 112765966A
Authority
CN
China
Prior art keywords
word
words
semantic
association
obtaining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110368415.4A
Other languages
Chinese (zh)
Other versions
CN112765966B (en
Inventor
刘艾婷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202110368415.4A priority Critical patent/CN112765966B/en
Publication of CN112765966A publication Critical patent/CN112765966A/en
Application granted granted Critical
Publication of CN112765966B publication Critical patent/CN112765966B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9532Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The disclosure provides an associative word duplication eliminating method and device, a computer readable storage medium and an electronic device. The method comprises the following steps: obtaining semantic association feature expression vectors among different candidate association words; processing semantic association feature expression vectors among different candidate association words by using a first classification model to obtain a first semantic repeating index among the different candidate association words; obtaining historical search behavior overlapping characteristic representation vectors among different candidate association words, wherein the historical search behavior overlapping characteristic representation vectors represent the overlapping degree of search behaviors among the different candidate association words; processing the historical search behavior overlapping feature representation vectors among different candidate association words by using a second classification model to obtain a second semantic repeating index among the different candidate association words; and according to a first semantic repetition index and a second semantic repetition index between different candidate association words, candidate association words with repeated semantics in the candidate association words are subjected to de-duplication filtering, and the target association words are determined.

Description

Method and device for removing duplicate of associated word, computer readable storage medium and electronic equipment
Technical Field
The present disclosure relates to the field of natural language processing technologies, and in particular, to a method and an apparatus for removing duplicate of an associative word, a computer-readable storage medium, and an electronic device.
Background
With the development of internet applications, more and more users trigger search operations to realize searches by inputting keywords in search pages or input keywords in input fields of input methods, in these similar application scenarios, in order to improve user experience, related technologies may sense the intentions of users through various technical means, list corresponding associated words according to the keywords input by the users, but related technologies cannot effectively filter the associated words with repeated semantics among the associated words, so that the associated words finally displayed to the users are not rich enough.
It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present disclosure.
Disclosure of Invention
The embodiment of the disclosure provides an associative word duplication elimination method and device, a computer-readable storage medium and electronic equipment, which can eliminate candidate associative words with duplicate semantics from candidate associative words according to the semantic duplication degree and the search behavior overlapping degree between different candidate associative words, so that richer target associative words can be displayed.
Additional features and advantages of the disclosure will be set forth in the detailed description which follows, or in part will be obvious from the description, or may be learned by practice of the disclosure.
The embodiment of the disclosure provides a method for removing duplication of associated words, which comprises the following steps: carrying out semantic recall on the query keywords to obtain candidate associated words of the query keywords; obtaining semantic association feature expression vectors among different candidate association words, wherein the semantic association feature expression vectors represent semantic repeating degrees among the different candidate association words; processing semantic association feature expression vectors among different candidate association words by using a first classification model to obtain a first semantic repeating index among the different candidate association words; obtaining a historical search behavior overlapping characteristic representation vector between different candidate association words, wherein the historical search behavior overlapping characteristic representation vector represents the degree of search behavior overlapping between the different candidate association words; processing the historical search behavior overlapping feature representation vectors among different candidate association words by using a second classification model to obtain a second semantic repeating index among the different candidate association words; according to a first semantic repetition index and a second semantic repetition index between different candidate association words, candidate association words with repeated semantics in the candidate association words are subjected to de-duplication filtering, a target association word is determined, and the query keyword and the target association word are displayed at the same time.
The embodiment of the present disclosure provides an apparatus for removing duplicate of an associative word, the apparatus including: the candidate associated word obtaining unit is used for carrying out semantic recall on the query key words to obtain candidate associated words of the query key words; the semantic association feature vector obtaining unit is used for obtaining semantic association feature expression vectors among different candidate association words, and the semantic association feature expression vectors express semantic repeating degrees among the different candidate association words; the first semantic repeating index obtaining unit is used for processing semantic association feature expression vectors among different candidate association words by using a first classification model to obtain first semantic repeating indexes among the different candidate association words; the search behavior overlap feature obtaining unit is used for obtaining historical search behavior overlap feature expression vectors among different candidate association words, and the historical search behavior overlap feature expression vectors represent the search behavior overlap degree among the different candidate association words; the second semantic repeating index obtaining unit is used for processing the historical search behavior overlapping feature expression vectors among different candidate association words by using a second classification model to obtain second semantic repeating indexes among the different candidate association words; and the candidate associated word duplicate removal filtering unit is used for removing duplicate of candidate associated words with repeated semantics in the candidate associated words according to a first semantic duplicate index and a second semantic duplicate index between different candidate associated words, determining target associated words and simultaneously displaying the query keyword and the target associated words.
In some exemplary embodiments of the present disclosure, the candidate associative words include a first associative word and a second associative word. Wherein the semantic relevance feature vector obtaining unit comprises: a distance information obtaining unit configured to obtain distance information between the first associated word and the second associated word; a common character information obtaining unit configured to obtain common character information between the first associated word and the second associated word; a character string length information obtaining unit configured to obtain character string length information between the first associated word and the second associated word, wherein the character string length information between the first associated word and the second associated word includes at least one of a word set length difference between a first word set of the first associated word and a second word set of the second associated word, a word set length ratio between the first word set and the second word set, a character string length difference between the first associated word and the second associated word, and a character string length ratio between the first associated word and the second associated word; a semantic association feature expression vector generation unit, configured to generate a semantic association feature expression vector between the first associated word and the second associated word according to distance information between the first associated word and the second associated word, common character information, and character string length information.
In some exemplary embodiments of the present disclosure, the distance information between the first and second associative words includes a semantic distance between the first and second associative words. Wherein the distance information obtaining unit includes: a character string length obtaining unit configured to obtain a first character string length of the first associated word and a second character string length of the second associated word; a semantic similarity obtaining unit, configured to obtain a semantic similarity between the first associated word and the second associated word; a character string common prefix length obtaining unit, configured to obtain a character string common prefix length between the first associated word and the second associated word; a similarity weight obtaining unit, configured to determine a similarity weight of the common prefix length of the character string; an editing similarity obtaining unit, configured to obtain an editing similarity between the first associated word and the second associated word according to a semantic similarity between the first associated word and the second associated word, a character string common prefix length, and a similarity weight of the character string common prefix length; and the semantic distance generating unit is used for obtaining the semantic distance between the first associated word and the second associated word according to the editing similarity between the first associated word and the second associated word.
In some exemplary embodiments of the present disclosure, the semantic similarity obtaining unit includes: a matching character number obtaining unit configured to obtain a number of matching characters between the first associative word and the second associative word; a character conversion frequency obtaining unit configured to obtain a frequency of character conversion between the first associated word and the second associated word; and the semantic similarity generating unit is used for obtaining the semantic similarity between the first associated word and the second associated word according to the first character string length, the second character string length, the number of the matched characters and the character conversion times.
In some exemplary embodiments of the present disclosure, the distance information between the first and second suggested words includes an edit distance between the first and second suggested words. Wherein the distance information obtaining unit includes: a minimum editing operation number obtaining unit, configured to obtain a minimum number of editing operations required to convert the first suggested word and the second suggested word from one to another; and the editing distance generating unit is used for obtaining the editing distance between the first associative word and the second associative word according to the minimum number of times of editing operation.
In some exemplary embodiments of the present disclosure, the distance information between the first and second suggested words includes a similar distance between the first and second suggested words. Wherein the distance information obtaining unit includes: a word set obtaining unit, configured to obtain a first word set of the first associated word and a second word set of the second associated word, where the first word set includes non-repeating words in the first associated word, and the second word set includes non-repeating words in the second associated word; an intersection element number obtaining unit, configured to obtain an intersection element number between the first word set and the second word set; a union element number obtaining unit, configured to obtain a union element number between the first word set and the second word set; a similarity coefficient obtaining unit, configured to obtain a similarity coefficient between the first associated word and the second associated word according to the intersection element number and the union element number between the first word set and the second word set; and the similar distance generating unit is used for obtaining the similar distance between the first associated word and the second associated word according to the similar coefficient between the first associated word and the second associated word.
In some exemplary embodiments of the present disclosure, the common character information between the first and second associative words includes at least one of a longest string common prefix length, a common word proportion, a word set merging set, and a word set intersection between the first and second associative words.
In some exemplary embodiments of the present disclosure, when the common character information between the first and second associated words includes a common word proportion between the first and second associated words, the common character information obtaining unit includes: a word sequence obtaining unit, configured to obtain a first word sequence of the first associated word and a second word sequence of the second associated word; a first character common word obtaining unit, configured to obtain a first character common word according to the number of characters in the first word sequence that belong to the second word sequence; a second character common word obtaining unit, configured to obtain a second character common word according to the number of characters in the second word sequence that belong to the first word sequence; a common word length obtaining unit, configured to obtain a common word length according to the length of the first character common word and the length of the second character common word; a word sequence length obtaining unit, configured to obtain a word sequence length according to a length of the first word sequence and a length of the second word sequence; and the public word proportion obtaining unit is used for obtaining the public word proportion between the first associated word and the second associated word according to the public word length and the word sequence length.
In some exemplary embodiments of the present disclosure, the character string length information between the first and second associative words includes at least one of a word set length difference between a first word set of the first associative word and a second word set of the second associative word, a word set length ratio between the first word set and the second word set, a character string length difference between the first and second associative words, and a character string length ratio between the first and second associative words.
In some exemplary embodiments of the present disclosure, the candidate associative words include a first associative word and a second associative word. Wherein the search behavior overlap feature obtaining unit includes: the first click exposure information obtaining unit is used for obtaining the first association words as search keywords, and obtaining a first history click webpage link and a first history click amount thereof, and a first history exposure webpage link and a first history exposure amount thereof in a preset time period; a second click exposure information obtaining unit, configured to obtain the second suggested word as a search keyword, a second history click web page link and a second history click amount thereof, and a second history exposure web page link and a second history exposure amount thereof within the predetermined time period; a click exposure overlapping degree obtaining unit, configured to obtain a click web page link overlapping degree and a web page link click overlapping degree between the first history click web page link and the second history click web page link, and an exposure web page link overlapping degree and a web page link exposure overlapping degree between the first history click web page link and the second history click web page link, according to the first history click web page link and a first history click amount thereof, the first history exposure web page link and a first history exposure amount thereof, the second history click web page link and a second history exposure amount thereof; and the search behavior overlap feature generation unit is used for generating a history search behavior overlap feature expression vector between different candidate association words according to the click webpage link overlap degree and the webpage link click overlap degree between the first history click webpage link and the second history click webpage link, and the exposure webpage link overlap degree and the webpage link exposure overlap degree between the first history exposure webpage link and the second history exposure webpage link.
In some exemplary embodiments of the present disclosure, the candidate associative words include a first associative word and a second associative word. Wherein the candidate associative word de-duplication filtering unit comprises a union set unit or a weighted summation unit. The merging set unit is configured to determine that semantic duplication exists between the first associated word and the second associated word if a first semantic duplication indicator between the first associated word and the second associated word is greater than a first threshold or a second semantic duplication indicator between the first associated word and the second associated word is greater than a second threshold, and select the first associated word or the second associated word as the target associated word. The weighted summation unit is used for determining a target semantic repeating index between the first associated word and the second associated word according to a first semantic repeating index and a second semantic repeating index between the first associated word and the second associated word; if the target semantic repetition index between the first associated word and the second associated word is larger than a target threshold value, judging that semantic repetition exists between the first associated word and the second associated word, and selecting the first associated word or the second associated word as the target associated word.
The disclosed embodiments provide a computer-readable storage medium on which a computer program is stored, which when executed by a processor implements the associative word de-duplication method as described in the above embodiments.
An embodiment of the present disclosure provides an electronic device, including: at least one processor; a storage device configured to store at least one program that, when executed by the at least one processor, causes the at least one processor to implement the method of de-duplicating associativity words as described in the above embodiments.
According to an aspect of the application, a computer program product or computer program is provided, comprising computer instructions, the computer instructions being stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method provided in the various alternative implementations of the embodiments described above.
In the technical solutions provided by some embodiments of the present disclosure, on one hand, a semantic matching degree between different candidate associated words is obtained by calculating semantic association feature expression vectors between the different candidate associated words to determine a first semantic repeating index between the different candidate associated words; on the other hand, according to the historical search behavior of different candidate association words, calculating the historical search behavior overlapping feature expression vector between different candidate association words, obtaining a second semantic repeating index between different candidate association words, simultaneously combining the obtained first semantic repeating index and the second semantic repeating index to judge whether semantic repetition exists between different candidate association words, and performing de-duplication filtering on the candidate association words with semantic repetition, effectively judging whether semantic repetition exists between different candidate association words by comprehensively considering the semantic repeating degree and the historical search behavior overlapping feature of different candidate association words, when the method is applied to a real search situation or an application scene such as an input method, the de-duplication of the candidate association words can be realized, the target association words without semantic repetition after de-duplication filtering are displayed, and more diversified target association words are recommended for the user, the method is beneficial to helping the user to find the wanted information more quickly.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure. It is to be understood that the drawings in the following description are merely exemplary of the disclosure, and that other drawings may be derived from those drawings by one of ordinary skill in the art without the exercise of inventive faculty.
Fig. 1 is a schematic diagram of an implementation environment of a method for removing duplicate association words according to an embodiment of the present disclosure.
Fig. 2 schematically illustrates a flow chart of an associative word deduplication method according to an embodiment of the present disclosure.
Fig. 3 schematically shows a flowchart of step S220 in fig. 2 in an exemplary embodiment.
Fig. 4 schematically shows a schematic diagram of obtaining a first semantic repetition index according to an embodiment of the present disclosure.
Fig. 5 schematically shows a flowchart of step S240 in fig. 2 in an exemplary embodiment.
FIG. 6 schematically shows a schematic diagram of obtaining a second semantic relatedness index according to an embodiment of the disclosure.
Fig. 7 schematically illustrates an interface diagram of an associational word deduplication method according to an embodiment of the present disclosure.
Fig. 8 schematically illustrates a block diagram of an associational de-duplication apparatus according to an embodiment of the present disclosure.
FIG. 9 shows a schematic structural diagram of an electronic device suitable for use in implementing embodiments of the present disclosure.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The same reference numerals denote the same or similar parts in the drawings, and thus, a repetitive description thereof will be omitted.
The described features, structures, or characteristics of the disclosure may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the subject matter of the present disclosure can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and the like. In other instances, well-known methods, devices, implementations, or operations have not been shown or described in detail to avoid obscuring aspects of the disclosure.
The drawings are merely schematic illustrations of the present disclosure, in which the same reference numerals denote the same or similar parts, and thus, a repetitive description thereof will be omitted. Some of the block diagrams shown in the figures do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in at least one hardware module or integrated circuit, or in different networks and/or processor means and/or microcontroller means.
The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the contents and steps, nor do they necessarily have to be performed in the order described. For example, some steps may be decomposed, and some steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.
In this specification, the terms "a", "an", "the", "said" and "at least one" are used to indicate the presence of at least one element/component/etc.; the terms "comprising," "including," and "having" are intended to be inclusive and mean that there may be additional elements/components/etc. other than the listed elements/components/etc.; the terms "first," "second," and "third," etc. are used merely as labels, and are not limiting on the number of their objects.
Based on the technical problems in the related art, the embodiments of the present disclosure provide an associative word deduplication method for at least partially solving the above problems. The method provided by the embodiments of the present disclosure may be executed by any electronic device, for example, a server, or a terminal, or an interaction between a server and a terminal, which is not limited in the present disclosure.
An embodiment of the present application provides a method for removing duplicates of associative words, please refer to fig. 1, which illustrates a schematic diagram of an implementation environment of the method for removing duplicates of associative words provided in the embodiment of the present application. The implementation environment may include: a terminal 11 and a server 12.
Both the terminal 11 and the server 12 can implement the association word de-duplication method in the present application. A user can input a query keyword through the terminal 11, the terminal 11 can send the query keyword to the server 12, the server 12 obtains the input query keyword, then semantically recalls the query keyword to obtain candidate associated words corresponding to the query keyword, further obtains semantic association feature expression vectors between different candidate associated words, processes the semantic association feature expression vectors between different candidate associated words by using a first classification model to obtain a first semantic repeating index between different candidate associated words, obtains historical search behavior overlap feature expression vectors between different candidate associated words, the historical search behavior overlap feature expression vectors represent the search behavior overlap degree between different candidate associated words, processes the historical search behavior overlap feature expression vectors between different candidate associated words by using a second classification model, obtaining a second semantic repeating index between different candidate association words, de-duplicating and filtering the candidate association words with repeated semantics according to the first semantic repeating index and the second semantic repeating index between the different candidate association words, determining a target association word, returning the target association word to the terminal 11 by the server 12, and displaying the target association word and the query keyword on the terminal 11 at the same time.
Or, after acquiring the input query keyword, the terminal 11 may obtain candidate association words of the query keyword, obtain semantic association feature expression vectors between different candidate association words according to the semantic association feature expression vectors between different candidate association words, process the semantic association feature expression vectors between different candidate association words by using the first classification model, obtain first semantic duplication indexes between different candidate association words, obtain historical search behavior duplication feature expression vectors between different candidate association words, where the historical search behavior duplication feature expression vectors represent overlapping degrees of search behaviors between different candidate association words, process the historical search behavior duplication feature expression vectors between different candidate association words by using the second classification model, and obtaining a second semantic repetition index among different candidate association words, according to the first semantic repetition index and the second semantic repetition index among the different candidate association words, performing de-duplication filtering on the candidate association words with repeated semantics, determining a target association word, and simultaneously displaying the target association word and the query keyword.
Or, the user may input a query keyword through the terminal 11, the terminal 11 may send the query keyword to the server 12, after the server 12 obtains the input query keyword, the server 12 may retrieve candidate associated words corresponding to the query keyword, the server 12 returns the candidate associated words obtained by the retrieval to the terminal 11, the terminal 11 obtains semantic association feature expression vectors between different candidate associated words, obtains semantic association feature expression vectors between different candidate associated words according to the semantic association feature expression vectors between different candidate associated words, processes the semantic association feature expression vectors between different candidate associated words by using the first classification model, obtains a first semantic repetition index between different candidate associated words, obtains a history search behavior overlap feature expression vector between different candidate associated words, the historical search behavior overlapping feature representation vector represents the overlapping degree of search behaviors among different candidate association words, the historical search behavior overlapping feature representation vector among the different candidate association words is processed by using a second classification model to obtain a second semantic repeating index among the different candidate association words, and according to the first semantic repeating index and the second semantic repeating index among the different candidate association words, candidate association words with repeated semantics in the candidate association words are subjected to de-duplication filtering to determine a target association word, and the target association word and the query keyword are displayed at the same time.
In the above embodiment, the first classification model and the second classification model are trained in advance, and may be stored locally in the terminal 11 or the server 12, or may be stored in another cloud server, when necessary, the terminal 11 or the server 12 obtains the first classification model and/or the second classification model from the cloud server, or the terminal 11 or the server 12 transmits the semantic association feature representation vector between different candidate association words and/or the historical search behavior overlapping feature representation vector between different candidate association words to the cloud server, and after the cloud server processes the semantic association feature representation vector between different candidate association words and/or the historical search behavior overlapping feature representation vector between different candidate association words by using the first classification model and/or the second classification model, returns the processed first semantic repetition index and/or the processed second semantic repetition index to the terminal 11 and/or the terminal 11 Server 12, to which this disclosure is not limited.
In the embodiment of the present disclosure, the terminal 11 may be a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, and the like, but is not limited thereto. The server 12 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a web service, cloud communication, a middleware service, a domain name service, a security service, a CDN (Content Delivery Network), a big data and artificial intelligence platform. The terminal 11 and the server 12 may be directly or indirectly connected through wired or wireless communication, and the present application is not limited thereto.
It should be understood by those skilled in the art that the above-mentioned terminal 11 and server 12 are only examples, and other existing or future terminals or servers may be suitable for the present application and are included within the scope of the present application and are herein incorporated by reference.
Fig. 2 schematically illustrates a flow chart of an associative word deduplication method according to an embodiment of the present disclosure. As shown in fig. 2, the method provided by the embodiment of the present disclosure may include the following steps.
In step S210, a semantic recall is performed on the query keyword to obtain a candidate associated word of the query keyword.
In the embodiment of the disclosure, the input query keyword is first obtained.
The query keyword may refer to any field (including at least one or any combination of words, sentences, words, symbols, and the like) input for finding any one or more of a specific file, website, record, word, or a series of records and the like in a database (which may include a block chain stored in a distributed manner), so that a search engine or the database can perform corresponding search according to the input keyword. The user may input a query keyword through the terminal, and then may or may not transmit the input query keyword to the server.
For example, the user may input the query keyword in a web address input field of a browser installed on the terminal of the user, where the query keyword may be a URL (Uniform Resource Locator) address, or may be a word such as a company name of XX company.
For another example, the user may input the query keyword in a search box of the application program through various application programs installed on the terminal of the user, so as to initiate a search request and return corresponding information.
For another example, the user may input the query keyword through an input field of an input method installed on his terminal. The present disclosure does not limit the input mode, acquisition source, and the like of the query keyword.
Then, semantic recall is carried out on the query keywords to obtain candidate associated words of the query keywords.
In the embodiment of the present disclosure, the candidate associative word is a richer search semantic or query semantic associated further on the basis of the input query keyword. The present disclosure does not limit the manner in which candidate associative words are recalled according to the query keyword.
For example, if the input query keyword is "music," the candidate association word "music download" and some other candidate association words can be obtained.
In step S220, a semantic association feature representation vector between different candidate association words is obtained, where the semantic association feature representation vector represents a semantic repeating degree between different candidate association words.
In the embodiment of the disclosure, in order to filter candidate association words with repeated semantics among candidate association words recalled by input query keywords, semantic association feature expression vectors between different candidate association words are calculated and used for expressing semantic matching degrees between different candidate association words, and the higher the semantic matching degree between different candidate association words is, the higher the probability that repeated semantics exists between corresponding candidate association words is; otherwise, the probability that semantic duplication exists between the corresponding candidate association words is less. The semantic relation feature representation vector may be calculated in a specific manner as described in the embodiments of fig. 3 and 4 below.
In step S230, the semantic association feature expression vectors between different candidate association words are processed by using the first classification model, so as to obtain a first semantic repeating index between different candidate association words.
In the embodiment of the disclosure, any suitable machine learning algorithm can be adopted to construct the first classification model, then the constructed first classification model is trained, and whether semantic repetition exists between any two candidate association words in the candidate association words recalled by the query keyword can be judged by using the trained first classification model.
In an exemplary embodiment, assuming that any two candidate association words among the candidate association words recalled based on the query keyword are the first association word and the second association word, respectively, a semantic association feature representation vector between the first association word and the second association word may be obtained. Then, processing the semantic association feature expression vector between different candidate association words by using the first classification model to obtain a first semantic repeating index between different candidate association words, which may include: inputting a semantic association feature expression vector between the first associative word and the second associative word into a trained first classification model; obtaining a first conditional probability with an input of a predetermined value based on a semantic association feature representation vector between the first associative word and the second associative word through the first classification model; and determining the first semantic repeating index according to the first conditional probability.
The value range of the first conditional probability is a real number which is greater than or equal to 0 and less than or equal to 1. The first conditional probability is positively correlated with the size of the first semantic repeating index, namely the larger the first conditional probability is, the higher the first semantic repeating index is; on the contrary, if the first conditional probability is smaller, the first semantic repeating index is lower, and the first semantic repeating index can be specifically set according to the actual situation, which is not limited by the disclosure.
In step S240, a historical search behavior overlap feature representation vector between different candidate association words is obtained, the historical search behavior overlap feature representation vector representing the degree of search behavior overlap between different candidate association words.
In the embodiment of the present disclosure, the historical search behavior overlap feature expression vector between different candidate association words may be obtained according to historical search behaviors of the different candidate association words, where the historical search behavior refers to that, in a search engine, each candidate association word is respectively used as a search keyword to query and recall a corresponding web page link (also referred to as a Uniform Resource Locator, which is abbreviated as URL hereinafter), and any operation behavior data of the URLs by a user may include, for example, an exposed web page link, a web page link clicked by the user in the exposed web page link, an exposure amount of the exposed web page link, a click amount of the clicked web page link, and the like. The higher the degree of overlap of search behavior between two different candidate associative words, the higher the probability of semantic duplication between the two candidate associative words. The manner of obtaining the historical search behavior overlap feature representation vector between different candidate associative words may be specifically referred to as described in the embodiments of fig. 5 and 6 below.
In step S250, the historical search behavior overlap feature representation vectors between different candidate association words are processed by using a second classification model, so as to obtain a second semantic repeating index between different candidate association words.
In the embodiment of the disclosure, any suitable machine learning algorithm can be adopted to construct the second classification model, then the constructed second classification model is trained, and the trained second classification model can be used for predicting the second semantic repeating index between any two candidate association words in the candidate association words recalled by the query keywords.
In the embodiment of the present disclosure, the first classification model and the second classification model may use the same machine learning algorithm, or may use different machine learning algorithms.
In an exemplary embodiment, assuming that any two candidate association words among the candidate association words recalled based on the query keyword are the first association word and the second association word, respectively, a historical search behavior overlap feature representation vector between the first association word and the second association word may be obtained. Then, processing the historical search behavior overlap feature representation vector between different candidate association words by using a second classification model to obtain a second semantic repeating index between different candidate association words, which may include: inputting a historical search behavior overlapping feature representation vector between the first associative word and the second associative word into a trained second classification model; obtaining a second conditional probability with an input of a predetermined value based on a historical search behavior overlap feature representation vector between the first associative word and the second associative word through the second classification model; and determining the second semantic repeating index according to the second conditional probability.
The value range of the second conditional probability is a real number greater than or equal to 0 and less than or equal to 1. The second conditional probability is positively correlated with the size of the second semantic repeating index, namely the larger the second conditional probability is, the higher the second semantic repeating index is; on the contrary, if the second conditional probability is smaller, the second semantic repeating index is lower, and the second semantic repeating index can be specifically set according to the actual situation, which is not limited by the disclosure.
In step S260, according to a first semantic repeating index and a second semantic repeating index between different candidate association words, candidate association words with repeated semantics in the candidate association words are deduplicated and filtered, and a target association word is determined, so as to simultaneously display the query keyword and the target association word.
In this embodiment of the present disclosure, according to a first semantic repetition index and a second semantic repetition index between different candidate associated words, deduplicating and filtering candidate associated words with semantic repetition in the candidate associated words, and determining a target associated word may include: if a first semantic repetition index between the first associated word and the second associated word is greater than a first threshold value or the second semantic repetition index is greater than a second threshold value, determining that semantic repetition exists between the first associated word and the second associated word, and selecting the first associated word or the second associated word as the target associated word; or determining a target semantic repeating index between the first associated word and the second associated word according to a first semantic repeating index and a second semantic repeating index between the first associated word and the second associated word; if the target semantic repetition index between the first associated word and the second associated word is larger than a target threshold value, judging that semantic repetition exists between the first associated word and the second associated word, and selecting the first associated word or the second associated word as the target associated word.
For example, assuming that a semantic association feature representation vector between a first associative word and a second associative word is processed through a trained first classification model, a first semantic repeating index s between the first associative word and the second associative word is obtained through prediction1(ii) a Processing the historical search behavior overlapping characteristic expression vector between the first associative word and the second associative word through a trained second classification model, and predicting to obtain a second semantic repeated index s between the first associative word and the second associative word2
Then the determination of whether there is a semantic duplication between the first and second suggested words may be made in any one of two ways:
mode 1 is to take the union, i.e. when s1Greater than a first threshold (e.g., 0.8, for illustration only) or s2Above a second threshold (e.g., 0.9, for example only), a semantic duplication of the first and second associative words is considered. In other words, the prediction results of the first classification model and the second classification model are merged.
The mode 2 is to perform weighted summation on the first semantic repeating index and the second semantic repeating index to determine a final target semantic repeating index s:
Figure 531719DEST_PATH_IMAGE001
in the above-mentioned formula,
Figure 245597DEST_PATH_IMAGE002
the weight coefficient of the first semantic repeating index,
Figure 946706DEST_PATH_IMAGE002
the contribution degree of the first semantic repeating index to the target semantic repeating index is represented, and the value range of the first semantic repeating index is real numbers which are larger than 0 and smaller than 1.
Whether semantic duplication exists between the first associated word and the second associated word is judged according to the target semantic duplication index obtained by the weighted summation, for example, if the target semantic duplication index s is greater than a target threshold (for example, 0.9, which is used for illustration only), the first associated word and the second associated word are considered to have semantic duplication.
Any one of the two modes can be selected according to the actual application scene.
For example, the target associated words may be displayed on a terminal that inputs the query keyword, and the user may select from the displayed target associated words, or may directly initiate a search request by using the query keyword itself, or may use the search request as characters input in the document.
On one hand, the method for removing the duplicate of the associated words, provided by the embodiment of the disclosure, obtains the semantic matching degree between different candidate associated words by calculating semantic association feature expression vectors between different candidate associated words so as to determine a first semantic repeating index between different candidate associated words; on the other hand, according to the historical search behavior of different candidate association words, calculating the historical search behavior overlapping feature expression vector between different candidate association words, obtaining a second semantic repeating index between different candidate association words, simultaneously combining the obtained first semantic repeating index and the second semantic repeating index to judge whether semantic repetition exists between different candidate association words, and performing de-duplication filtering on the candidate association words with semantic repetition, effectively judging whether semantic repetition exists between different candidate association words by comprehensively considering the semantic repeating degree and the historical search behavior overlapping feature of different candidate association words, when the method is applied to a real search situation or an application scene such as an input method, the de-duplication of the candidate association words can be realized, the target association words without semantic repetition after de-duplication filtering are displayed, and more diversified target association words are recommended for the user, the method is beneficial to helping the user to find the wanted information more quickly.
Fig. 3 schematically shows a flowchart of step S220 in fig. 2 in an exemplary embodiment. As shown in fig. 3, the difference from the embodiment of fig. 2 is that, taking any two candidate associations, i.e. the first association and the second association, as an example, the step S220 may further include steps S221 to S224.
In step S221, distance information between the first and second associative words is obtained.
In the embodiment of the disclosure, the distance information between the first associated word and the second associated word represents the semantic similarity between the first associated word and the second associated word, and the smaller the distance information between the first associated word and the second associated word, the larger the semantic similarity between the first associated word and the second associated word; conversely, the smaller the semantic similarity between the two.
In an exemplary embodiment, the distance information between the first and second associative words may include a semantic distance between the first and second associative words. Obtaining distance information between the first associated word and the second associated word may include: obtaining semantic similarity between the first associated word and the second associated word; obtaining a common prefix length of a character string between the first associated word and the second associated word; determining similarity weight of the common prefix length of the character string; obtaining editing similarity between the first association word and the second association word according to semantic similarity between the first association word and the second association word, the common prefix length of the character string and the similarity weight of the common prefix length of the character string; and obtaining the semantic distance between the first associated word and the second associated word according to the editing similarity between the first associated word and the second associated word.
In an exemplary embodiment, obtaining the semantic similarity between the first associative word and the second associative word may include: obtaining a first character string length of the first associated word and a second character string length of the second associated word; obtaining the number of matched characters between the first associated word and the second associated word; obtaining the number of character conversion times between the first associated word and the second associated word; and obtaining the semantic similarity between the first associative word and the second associative word according to the first character string length, the second character string length, the number of the matched characters and the number of character conversion times. The calculation of semantic distance and semantic similarity may be specifically referred to as described in the embodiment of fig. 4 below.
In an exemplary embodiment, the distance information between the first and second suggested words includes an edit distance between the first and second suggested words. Obtaining distance information between the first associated word and the second associated word may include: obtaining the minimum number of editing operations required for converting the first associative word and the second associative word from one to the other; and obtaining the editing distance between the first associative word and the second associative word according to the minimum number of editing operations. The calculation of the edit distance may be specifically described below with reference to the embodiment of fig. 4.
In an exemplary embodiment, the distance information between the first and second suggested words may include a similar distance between the first and second suggested words. Obtaining distance information between the first associated word and the second associated word may include: obtaining a first character set of the first associated word and a second character set of the second associated word, wherein the first character set comprises non-repeated characters in the first associated word, and the second character set comprises non-repeated characters in the second associated word; obtaining the number of intersection elements between the first word set and the second word set; acquiring the number of union elements between the first word set and the second word set; obtaining a similarity coefficient between the first associative word and the second associative word according to the intersection element number and the union element number between the first word set and the second word set; and obtaining the similar distance between the first associated word and the second associated word according to the similar coefficient between the first associated word and the second associated word. The calculation of the similar distance may be specifically referred to as described in the embodiment of fig. 4 below. In the embodiment of fig. 4, the distance information includes a semantic distance, an edit distance, and a similar distance at the same time, but the disclosure is not limited thereto.
In the embodiment of the present disclosure, the word set (including the first word set and the second word set) is obtained by performing word-level word segmentation processing on corresponding candidate associated words (such as the first associated word and the second associated word), and the words in the word set do not take into consideration the order, and meanwhile, the repeated words are removed from the words in the word set.
For example, assuming that the first association is "detection result after nucleic acid detection" and the second association is "nucleic acid detection cost", the first set of words is { 'nucleic', 'acid', 'detection', 'post', 'of', 'combination', 'result', }, and the second set of words is { 'nucleic', 'acid', 'detection', 'cost', 'use'.
In step S222, common character information between the first and second associative words is obtained.
In an exemplary embodiment, the common character information between the first and second associative words may include at least one of a longest string common prefix length, a common word proportion, a word set union set, a word set intersection, and the like between the first and second associative words. In the following fig. 4 embodiment, the common character information between the first associated word and the second associated word includes the longest character string common prefix length between the first associated word and the second associated word, a common word proportion, a word set merging set, a word set intersection, and the like.
In an exemplary embodiment, when the common character information between the first and second associated words includes a common word proportion between the first and second associated words, obtaining the common character information between the first and second associated words may include: obtaining a first word sequence of the first associative word and a second word sequence of the second associative word; obtaining a first character public word according to the number of characters in the first character sequence, which belong to the second character sequence; obtaining a second character public word according to the number of characters in the second character sequence, which belong to the first character sequence; obtaining the length of the common word according to the length of the first character common word and the length of the second character common word; obtaining the length of the word sequence according to the length of the first word sequence and the length of the second word sequence; and obtaining the public word proportion between the first associative word and the second associative word according to the public word length and the word sequence length.
In the embodiment of the present disclosure, the word sequence (including the first word sequence and the second word sequence) is obtained by performing word-level word segmentation processing on corresponding candidate associated words (such as the first associated word and the second associated word), and the words in the word sequence consider the order of the corresponding words in the candidate associated words, and meanwhile, the repeated words are not removed by the words in the word sequence.
For example, assuming that the first associative word is "detection result after nucleic acid detection" and the second associative word is "nucleic acid detection cost", the first literal sequence is { 'nucleus', 'acid', 'detection', 'assay', 'post', 'detection', 'assay', 'knot', 'fruit', }, and the second literal sequence is { 'nucleus', 'acid', 'detection', 'assay', 'cost', 'use'.
In step S223, character string length information between the first and second associative words is obtained.
In an exemplary embodiment, the string length information between the first and second associated words may include at least one of a word set length difference between a first word set of the first associated word and a second word set of the second associated word, a word set length ratio between the first word set and the second word set, a string length difference between the first and second associated words, a string length ratio between the first and second associated words, and the like. In the embodiment of fig. 4, the information on the length of the character string between the first associated word and the second associated word includes a difference in length of a word set between a first word set of the first associated word and a second word set of the second associated word, a ratio of length of the word set between the first word set and the second word set, a difference in length of the character string between the first associated word and the second associated word, and a ratio of length of the character string between the first associated word and the second associated word, etc., which are taken as examples, but the disclosure is not limited thereto.
In step S224, a semantic relation feature expression vector between the first associated word and the second associated word is generated according to the distance information between the first associated word and the second associated word, the common character information, and the character string length information.
In the embodiment of the present disclosure, distance information between a first associated word and a second associated word, common character information, and character string length information may be spliced to serve as a semantic association feature expression vector between the first associated word and the second associated word. However, the present disclosure is not limited to this, and for example, different weight coefficients may be provided for the distance information, the common character information, and the character string length information, and the distance information, the common character information, and the character string length information may be multiplied by the corresponding weight coefficients and then concatenated to generate a semantic relation feature expression vector between the first associated word and the second associated word.
The method for removing the duplicate of the associated words provided by the embodiment of the disclosure converts the problem of removing the duplicate of the associated words into the problem of judging whether semantic matching exists between different candidate associated words, represents semantic association characteristic expression vectors between different candidate associated words by obtaining distance information, common character information and character string length information between different candidate associated words, obtains semantic matching degrees between different candidate associated words, can further effectively judge whether semantic duplication exists between different candidate associated words, can be applied to a real search situation or an application scene such as an input method, realizes the duplicate removal of the candidate associated words, displays the target associated words without semantic duplication after the duplicate removal filtering, recommends diverse target associated words for a user, and is beneficial to helping the user to find information more quickly.
Fig. 4 schematically shows a schematic diagram of obtaining a first semantic repetition index according to an embodiment of the present disclosure.
The model utilized by the associativity de-duplication method shown in fig. 4 may include a first feature extraction layer and a first prediction layer, which will be described separately below.
In the embodiment of fig. 4, it is assumed that the first feature extraction layer has two text character strings input in the candidate associations, that is, two candidate associations, which are respectively referred to as: associative word 1 (denoted as
Figure 748440DEST_PATH_IMAGE003
I.e. the first associative word) and the associative word 2 (denoted as
Figure 359550DEST_PATH_IMAGE004
I.e., the second associative word), and associate 1 (b)
Figure 616567DEST_PATH_IMAGE003
) The corresponding first word set is represented as
Figure 555705DEST_PATH_IMAGE005
Will associate the words 2 (
Figure 285763DEST_PATH_IMAGE004
) The corresponding second set of words is represented as
Figure 876013DEST_PATH_IMAGE006
I.e. by
Figure 275902DEST_PATH_IMAGE005
And
Figure 561390DEST_PATH_IMAGE006
respectively comprise
Figure 721238DEST_PATH_IMAGE003
And
Figure 775782DEST_PATH_IMAGE004
non-repeating words of (a). At the same time, associate the words 1 (A)
Figure 877730DEST_PATH_IMAGE003
) The corresponding first word sequence is expressed as the associative word 2: (
Figure 775147DEST_PATH_IMAGE004
) The corresponding second word sequence is represented as
Figure 847009DEST_PATH_IMAGE006
. Namely, it is
Figure 897004DEST_PATH_IMAGE005
And
Figure 655007DEST_PATH_IMAGE006
respectively comprise
Figure 915087DEST_PATH_IMAGE003
And
Figure 666005DEST_PATH_IMAGE004
non-repeating words of (a). At the same time, associate the words 1 (A)
Figure 85354DEST_PATH_IMAGE003
) The corresponding first word sequence is represented as
Figure 388160DEST_PATH_IMAGE007
Will associate the words 2 (
Figure 158987DEST_PATH_IMAGE004
) The corresponding second word sequence is represented as
Figure 934307DEST_PATH_IMAGE007
. Extracting features using the feature extraction layer, assuming that F1: semantic distance, F2: edit distance, F3: similar distance, F4: longest common prefix length (i.e., longest string common prefix length), F5: common word ratio, F6: union of word sets (i.e., union of word sets), F7: intersection of word sets (i.e., word set intersection), F8: word set length difference, F9: word set length ratio, F10: string length difference, F11: string length ratio. Corresponding to the association word 1 and the association word 2, 11 features can be obtained by calculation and spliced to obtain an 11-dimensional semantic association feature expression vector
Figure 408013DEST_PATH_IMAGE008
As input to the first prediction layer, the formalization is represented as:
Figure 518052DEST_PATH_IMAGE009
in the above formula (2), concat represents the meaning of concatenation or concatenation.
The features are described below.
Feature F1: semantic distance:
firstly, a calculation formula of semantic similarity between a first associative word and a second associative word is given:
Figure 594461DEST_PATH_IMAGE010
in the above-mentioned formula (3),
Figure 863769DEST_PATH_IMAGE011
indicating a first string length corresponding to the associative word 1,
Figure 383743DEST_PATH_IMAGE012
indicating the second string length for the associative word 2. And m is the number of matched characters between the first character string corresponding to the association word 1 and the second character string corresponding to the association word 2. t represents the first character string and association corresponding to the association word 1The number of character conversions between the second character strings corresponding to the word 2.
In particular, define
Figure 731810DEST_PATH_IMAGE013
For the matching window, the comparison between the characters of the association word 1 and the association word 2 is limited in the matching window, if the two characters of the association word 1 and the association word 2 are equal in the range of the matching window, the matching is successful, and if the range of the matching window is exceeded, the matching is failed. I.e. only when
Figure 487276DEST_PATH_IMAGE003
And
Figure 220877DEST_PATH_IMAGE004
are the same and are not more than
Figure 36386DEST_PATH_IMAGE013
The corresponding two characters are considered to be matched when the distance between the two characters exceeds the matching window, and even if one character in the association word 2 is equal to a certain character in the association word 1, the distance between the two characters is too far away, the two characters have too low correlation, and the characters cannot be considered to be matched. Will be provided with
Figure 370284DEST_PATH_IMAGE003
And
Figure 70387DEST_PATH_IMAGE004
the matched characters are compared, and the number of characters with the same position but different characters is divided by 2 to obtain the number t of times of conversion.
For example, assume that a first character string a ("bacde") corresponding to the association 1 and a second character string B ("abed") corresponding to the association 2 are matched during matching, the matching window size is 1, the characters 'a', 'B','d' are matched, indexInA ('d') =3 (i.e., the subscript of the character'd' in a is 3), and indexInB ('d') =3 (i.e., the subscript of the character'd' in B is also 3), and the distance between the indexInA and the indexInB is 0 and is smaller than the matching window size. However, for the character 'e', although both the first and second character strings have the character 'e', they are not matched because the subscripts of 'e' in the first and second character strings are 4 and 2, respectively, and the distance is 2 > 1 (matching window), so that 'e' is not matched. In this example, m =3 since there are 3 characters matching. Also looking at this example, 'a' and 'b' are both matched, but 'a' and 'b' are represented in two strings as "ba …" and "ab …", respectively, in different orders, so that here t = 1.
The semantic similarity is calculated and obtained through the method
Figure 783128DEST_PATH_IMAGE014
Then, the editing similarity between the association word 1 and the association word 2 can be calculated and obtained according to the following formula
Figure 395637DEST_PATH_IMAGE015
Figure 701985DEST_PATH_IMAGE016
In the above-mentioned formula (4),
Figure 64833DEST_PATH_IMAGE017
indicating the number of common prefix characters of the first and second strings, i.e. the string common prefix length, e.g. assuming
Figure 287873DEST_PATH_IMAGE003
Is an "apple",
Figure 54972DEST_PATH_IMAGE004
is apple cell phone, then
Figure 238828DEST_PATH_IMAGE017
And (2). And assume that
Figure 765887DEST_PATH_IMAGE017
Is 4.
Figure 328587DEST_PATH_IMAGE018
Is a constant factor, is a scaling factor constant, and describes the contribution of the common prefix to the similarity, and is therefore referred to herein as the similarity weight for the common prefix length of the string,
Figure 656800DEST_PATH_IMAGE018
the larger the weight of the common prefix, for the common prefix
Figure 187007DEST_PATH_IMAGE018
The adjustment can be carried out upwards,
Figure 891658DEST_PATH_IMAGE018
not more than 0.25, otherwise the editing similarity would exceed 1, constant
Figure 574443DEST_PATH_IMAGE018
Is 0.1.
In the embodiment of the present disclosure, the prefix refers to the whole head combination of a character string except the last character, and in the following, "ABCDABD" is taken as an example, and the prefix is [ a, AB, ABC, ABCD, ABCDA, abcdabdb ].
The semantic distance between the association word 1 and the association word 2 can be calculated according to the following formula:
Figure 965235DEST_PATH_IMAGE019
feature F2: the edit distance is the minimum number of edit operations required for converting one character string corresponding to the association word 1 and the second character string corresponding to the association word 2 into the other character string, and can be used for measuring the similarity between the two character strings to obtain the similarity between the two character strings. The operations may include replacing one character with another, inserting one character, and deleting one character.
For example, converting the first string "kitten" into the second string "sitting":
sitten (k is replaced by s)
sittin (e is replaced by i)
sitting (insert g)
Feature F3: the similarity distance is measured by the proportion of different elements in the first character set corresponding to the association word 1 and the second character set corresponding to the association word 2 in all the elements.
First, the similarity coefficient between the associative word 1 and the associative word 2 can be calculated by the following formula
Figure 592526DEST_PATH_IMAGE020
Figure 976234DEST_PATH_IMAGE021
The similarity coefficient refers to the first set of words
Figure 762793DEST_PATH_IMAGE005
And a second set of words
Figure 432809DEST_PATH_IMAGE006
Number of intersection elements of
Figure 688341DEST_PATH_IMAGE022
In the first word set
Figure 677070DEST_PATH_IMAGE005
And a second set of words
Figure 819601DEST_PATH_IMAGE006
Number of elements of union set
Figure 660518DEST_PATH_IMAGE023
The ratio of (A) to (B).
The formula for calculating the similar distance can be expressed as:
Figure 403346DEST_PATH_IMAGE024
feature F4: the longest common prefix length defined as the longest common prefix length of the first character string corresponding to the association word 1 and the second character string corresponding to the association word 2, that is, the longest common prefix length of the character strings, can be used to measure the difference between the two association words to some extent, and the longer the common prefix length of the longest common character string is, the higher the repetition degree of the two association words is, the higher the possibility of containing the same information is. Feature F4 may be expressed as the following equation:
Figure 269802DEST_PATH_IMAGE025
in the above-mentioned formula (8),
Figure 578423DEST_PATH_IMAGE026
and
Figure 714876DEST_PATH_IMAGE027
and the ith same prefix character in the first word sequence and the second word sequence is represented, wherein i is a positive integer greater than or equal to 1.
Feature F5: the public word proportion is the proportion of the public words in the association words 1 and 2.
First, the number of characters belonging to the second word sequence in the first word sequence, i.e. the common word of the first character, can be calculated according to the following formula
Figure 679421DEST_PATH_IMAGE028
Figure 693775DEST_PATH_IMAGE029
The number of characters belonging to the first character sequence in the second character sequence, i.e. the common character of the second character, can be calculated according to the following formula
Figure 309433DEST_PATH_IMAGE030
Figure 492153DEST_PATH_IMAGE031
For example: suppose a first word sequence
Figure 163568DEST_PATH_IMAGE032
= "apple",
Figure 230881DEST_PATH_IMAGE033
= "iphone", then, according to the formula,
Figure 435466DEST_PATH_IMAGE034
is an "apple" character and appears in
Figure 320246DEST_PATH_IMAGE033
1, so that the meter is 1; by the way of analogy, the method can be used,
Figure 947798DEST_PATH_IMAGE032
need to judge whether the word appears in
Figure 553223DEST_PATH_IMAGE033
In, it is obvious
Figure 753260DEST_PATH_IMAGE032
All 3 words in appear
Figure 667996DEST_PATH_IMAGE033
In, so that
Figure 625587DEST_PATH_IMAGE028
=3, and
Figure 785435DEST_PATH_IMAGE033
in which only 2 words appear
Figure 839979DEST_PATH_IMAGE032
In, therefore, the
Figure 676348DEST_PATH_IMAGE030
= 2。
Then, the common word proportion F5 is obtained by the following formula:
Figure 199864DEST_PATH_IMAGE035
in the above-mentioned formula (11),
Figure 271726DEST_PATH_IMAGE036
indicating the length of the common word of the first character,
Figure 587300DEST_PATH_IMAGE037
the length of the second character common word is represented and the sum of the two results in the common word length.
Figure 843838DEST_PATH_IMAGE038
Indicates the length of the first word sequence and,
Figure 41601DEST_PATH_IMAGE039
the length of the second word sequence is represented and the sum of the two yields the word sequence length.
Feature F6: word set union represents the first word set of associative word 1
Figure 554971DEST_PATH_IMAGE005
And a second set of associative words 2
Figure 584107DEST_PATH_IMAGE006
The number of intersection elements of (a) can be expressed by the following formula:
Figure 762278DEST_PATH_IMAGE040
feature F7: the intersection of the word sets represents the first word set of associative words 1
Figure 634288DEST_PATH_IMAGE005
And a second set of associative words 2
Figure 251215DEST_PATH_IMAGE006
The number of intersection elements of (a) can be expressed by the following formula:
Figure 26535DEST_PATH_IMAGE041
feature F8: word set length difference represents first word set of associative word 1
Figure 375608DEST_PATH_IMAGE005
And a second set of associative words 2
Figure 531651DEST_PATH_IMAGE006
The number of elements in the difference set can be expressed by the following formula:
Figure 624372DEST_PATH_IMAGE042
feature F9: first word set with word set length ratio representing associative word 1
Figure 519778DEST_PATH_IMAGE005
And a second set of associated words 2
Figure 898807DEST_PATH_IMAGE006
The ratio of the number of elements can be expressed by the following formula:
Figure 683092DEST_PATH_IMAGE043
feature F10: the character string length difference represents an absolute value of a length difference between a first character string corresponding to the association word 1 and a second character string corresponding to the association word 2, and can be represented by the following formula:
Figure 438559DEST_PATH_IMAGE044
feature F11: the character string length ratio represents the length ratio between the first character string corresponding to the association word 1 and the second character string corresponding to the association word 2, and can be expressed by the following formula:
Figure 437739DEST_PATH_IMAGE045
the first prediction layer in the embodiment of fig. 4 includes a trained first classification model, and the prediction outputs a first semantic repeating index. Assuming that the first classification model is a two-class model, taking a log-linear model as an example, the first semantic repeating index of two associated words can be predicted by the first classification model. The first classification model is simple and easy to use, and is low in calculation cost and low in service deployment cost while the accuracy of the prediction effect is guaranteed. However, the present disclosure is not limited thereto, and other two-class models may be used.
In the embodiment of fig. 4, it is assumed that the first conditional probability distribution is predicted as:
Figure 144926DEST_PATH_IMAGE046
Figure 308185DEST_PATH_IMAGE047
in the above equations (18) and (19), W is obtained by training with some samples1And b1The value of (a) is,
Figure 742708DEST_PATH_IMAGE048
i.e. the first conditional probability that there is semantic duplication of the predicted associative word 1 and the associative word 2. The first semantic repeat indicator may be determined based on the first conditional probability, and in some embodiments, the first semantic repeat indicator may be set equal to the first conditional probability, for example, if
Figure 721029DEST_PATH_IMAGE048
The calculation is equal to 0.5, then the first semantic repeat index for the association of 1 and 2 is equal to 0.5.
Fig. 5 schematically shows a flowchart of step S240 in fig. 2 in an exemplary embodiment. Also, taking the example that the candidate associative words include the first associative word and the second associative word, the difference from the above embodiment is that the step S240 in the above embodiment may further include the following steps.
In step S241, the first association word is obtained as a search keyword, a first history click web page link and a first history click volume thereof, and a first history exposure web page link and a first history exposure volume thereof within a predetermined time period.
In the embodiment of the disclosure, in a predetermined time period (which may be set according to actual conditions, for example, the last month), the user takes the first associated word as a search keyword (query), the search engine recalls the corresponding URL after receiving the search keyword input by the user, the user clicks one or more URLs therein, the clicked one or more URLs are referred to as first history clicked web page links, and the number of clicks of each of the clicked one or more URLs in the predetermined time period is referred to as a first history click amount.
In the embodiment of the disclosure, in a predetermined time period, the user takes the first associated word as a search keyword (query), the search engine recalls the corresponding URL after receiving the search keyword input by the user, and exposes one or more URLs therein to the user after various processing (for example, sorting, etc.), the exposed one or more URLs are referred to as a first historical exposure webpage link, and the number of exposures of each URL of the exposed one or more URLs in the predetermined time period is referred to as a first historical exposure.
In step S242, the second suggested word is obtained as a search keyword, a second history click web page link and a second history click volume thereof, and a second history exposure web page link and a second history exposure volume thereof in the predetermined time period.
In the embodiment of the disclosure, in a predetermined time period, the user takes the second associated word as a search keyword (query), the search engine recalls corresponding URLs after receiving the search keyword input by the user, the user clicks one or more URLs therein, the clicked one or more URLs are referred to as second history clicked web page links, and the number of clicks of each of the clicked one or more URLs in the predetermined time period is referred to as a second history click amount.
In the embodiment of the disclosure, in a predetermined time period, the user takes the second associated word as a search keyword (query), the search engine recalls the corresponding URL after receiving the search keyword input by the user, and exposes one or more URLs therein to the user after various processing (for example, sorting, etc.), the exposed one or more URLs are referred to as a second history exposure web page link, and the number of exposures of each URL of the exposed one or more URLs in the predetermined time period is referred to as a second history exposure.
In step S243, according to the first history clicked web page link and the first history clicked amount thereof, the first history exposed web page link and the first history exposure amount thereof, the second history clicked web page link and the second history clicked amount thereof, and the second history exposed web page link and the second history exposure amount thereof, the clicked web page link overlapping degree and the web page link clicking overlapping degree between the first history clicked web page link and the second history clicked web page link, and the exposed web page link overlapping degree and the web page link exposure overlapping degree between the first history exposed web page link and the second history exposed web page link are obtained.
In the embodiment of the disclosure, the overlap degree of the clicked web page links represents the number of web page links which are the same or overlapped in the first history clicked web page links and the second history clicked web page links. The webpage link click overlapping degree represents the weighted sum of the same or overlapped webpage links and the click rate thereof in the first historical clicked webpage link and the second historical clicked webpage link, wherein the click rate of the same or overlapped webpage links in the first historical clicked webpage link and the second historical clicked webpage link can be equal to the sum of the first historical click rate and the second historical click rate of the same or overlapped webpage links.
In the embodiment of the disclosure, the exposure webpage link overlapping degree represents the number of the same or overlapping webpage links in the first history exposure webpage link and the second history exposure webpage link. The webpage link exposure overlapping degree represents a weighted sum of the same or overlapped webpage links and the exposure amount thereof in the first history exposure webpage link and the second history exposure webpage link, wherein the exposure amount of the same or overlapped webpage links in the first history exposure webpage link and the second history exposure webpage link can be equal to the sum of the first history exposure amount and the second history exposure amount of the same or overlapped webpage links.
In step S244, a history search behavior overlap feature representation vector between different candidate association words is generated according to the clicked web page link overlap and web page link click overlap between the first history clicked web page link and the second history clicked web page link, and the exposed web page link overlap and web page link exposure overlap between the first history exposed web page link and the second history exposed web page link.
The way in which the second semantic relatedness indicator is obtained is illustrated by the embodiment of fig. 6.
The method for removing duplication of the associated word as shown in fig. 6 is implemented based on a click log table and an exposure log table of a search engine, and the utilized model may include a second feature extraction layer and a second prediction layer, and first obtains the historical search behavior features of the associated word 1 and the associated word 2, and then constructs a second classification model, which will be introduced below.
In the embodiment of fig. 6, it is assumed that the second feature extraction layer has two text character strings of the candidate associations, that is, two candidate associations, which are respectively referred to as: associative word 1 (denoted as
Figure 832073DEST_PATH_IMAGE003
I.e. the first associative word) and the associative word 2 (denoted as
Figure 92415DEST_PATH_IMAGE004
I.e., the second suggested word). For two associative words, it is assumed here that 4 histories are obtainedSearching the behavior characteristics: f1 represents the click web page link overlapping degree between the first history click web page link and the second history click web page link, f2 represents the click web page link overlapping degree between the first history click web page link and the second history click web page link, f3 represents the exposure web page link overlapping degree between the first history exposure web page link and the second history exposure web page link, f4 represents the history search behavior overlapping characteristic representation vector between the first history exposure web page link and the second history exposure web page link, and a 4-dimensional history search behavior overlapping characteristic representation vector is obtained according to the 4 history search behavior characteristics
Figure 189684DEST_PATH_IMAGE049
As input to the second prediction layer, the formalization is represented as:
Figure 632298DEST_PATH_IMAGE050
these four features are described below.
Characteristic f 1: here, the example of the overlap degree of the clicked URL showing the overlap degree of the clicked web link between the first history clicked web link and the second history clicked web link is given as an example of 100 (the top100 is merely for illustration, is not limited thereto, and may be selected according to actual circumstances).
Assuming that a click log table corresponding to search keywords (here, the search keywords are association words 1 and 2) of a whole month is selected, and counting a first history clicked web page link URL clicked by a user corresponding to the association word 1 in the click log table, for example, assuming that the user clicks URL1_1 (the corresponding first history click amount is assumed to be uv1_ 1) and … URL1_ n (the corresponding first history click amount is assumed to be uv1_ n), where n represents the number of first history clicked web page links, and n is a positive integer greater than or equal to 1; for example, assume that the user clicks URL2_1 (the corresponding second historical click amount is assumed to be uv2_ 1), … URL2_ m (the corresponding second historical click amount is assumed to be uv2_ m), m represents the number of web links clicked by the second history, and m is a positive integer greater than or equal to 1. In the embodiment of the disclosure, if the first history clicked web page link and the second history clicked web page link are highly overlapped, it indicates that the semantic repetition relationship exists between the association word 1 and the association word 2 to a great extent.
For example, assume that the first history clicked web page link list corresponding to the association word 1 is Uuv1The second history click webpage link list corresponding to the association word 2 is Uuv2Respectively make Uuv1And Uuv2Sorting in a descending order according to the first historical click rate and the second historical click rate, and intercepting 100 top-ranked first historical click web page links as UTuv1Intercepting and recording a second historical click web page link with a second historical click amount of 100 ranks as UTuv2The top100 click URL overlap is defined as UTuv1And UTuv2The calculation formula of the number of overlapped URLs in the table is as follows:
Figure 55189DEST_PATH_IMAGE051
characteristic f 2: the top100 exposure URL overlapping degree represents the exposure web page link overlapping degree between the first history exposure web page link and the second history exposure web page link by way of example.
Similar to the feature f1, assuming that an exposure log table corresponding to a search keyword of a whole month of a certain month is selected, counting a first historical exposure webpage link URL exposed to the user corresponding to the association word 1 in the exposure log table, for example, it is assumed that URL1_1 (corresponding to the first historical exposure is assumed to be pv1_ 1) and … URL1_ n (corresponding to the first historical exposure is assumed to be pv1_ n) are exposed to the user, n represents the number of first historical exposure webpage links, and n is a positive integer greater than or equal to 1; for example, assume that URL2_1 (corresponding second historical exposure amount is assumed to be pv2_ 1), … URL2_ m (corresponding second historical exposure amount is assumed to be pv2_ m) exposed to the user, m represents the number of links of the second historical exposure web page, and m is a positive integer greater than or equal to 1. In the embodiment of the disclosure, it is considered that if the first history exposure web page link and the second history exposure web page link are highly overlapped, it indicates that the semantic repetition relationship exists between the association word 1 and the association word 2 to a great extent.
For example, assume that the link list of the first history exposure web page corresponding to the association word 1 is Upv1The link list of the second historical exposure webpage corresponding to the association word 2 is Upv2Respectively make Upv1And Upv2Sorting the first historical exposure and the second historical exposure in a descending order, and intercepting a first historical exposure webpage link which is 100 times earlier than the first historical exposure as UTpv1And intercepting a link of a second historical exposure webpage with a second historical exposure amount of 100 before the second historical exposure amount as UTpv2Top100 exposure URL overlap is defined as UTpv1And UTpv2The number of overlapped URLs in (1) is calculated by the formula:
Figure 379991DEST_PATH_IMAGE052
characteristic f 3: the top100 click URL click overlap degree represents the webpage link click overlap degree between the first history click webpage link and the second history click webpage link by way of example.
On the basis of the feature f1, the link list of the first history clicked web page corresponding to the association word 1 is assumed to be Uuv1The second history click webpage link list corresponding to the association word 2 is Uuv2Respectively make Uuv1And Uuv2Sorting in a descending order according to the first historical click rate and the second historical click rate, and intercepting 100 top-ranked first historical click web page links as UTuv1Intercepting a second historical click web page link with a second historical click rate of 100Is denoted as UTuv2The click overlap of top100 click URL is defined as UTuv1And UTuv2The weighted summation of the overlapped URLs in the process and the corresponding click volumes is calculated by the following formula:
Figure 313575DEST_PATH_IMAGE053
wherein in the above formula, i ∈ { UT)uv1⋂ UTuv2},UViIs UTuv1And UTuv2The number of clicks of the overlapping URLs.
Characteristic f 4: the top100 exposure URL exposure overlapping degree represents the web page link exposure overlapping degree between the first history exposure web page link and the second history exposure web page link by way of example.
On the basis of the feature f2, the link list of the first history exposure web page corresponding to the association word 1 is assumed to be Upv1The link list of the second historical exposure webpage corresponding to the association word 2 is Upv2Respectively make Upv1And Upv2Sorting the first historical exposure and the second historical exposure in a descending order, and intercepting a first historical exposure webpage link which is 100 times earlier than the first historical exposure as UTpv1And intercepting a link of a second historical exposure webpage with a second historical exposure amount of 100 before the second historical exposure amount as UTpv2The exposure overlap of top100 exposure URL is defined as UTpv1And UTpv2The weighted summation of the overlapped URLs in the process and the corresponding exposure is calculated by the following formula:
Figure 908DEST_PATH_IMAGE054
where j ∈ { UT)pv1⋂ UTpv2},PVjIs UTpv1And UTpv2Exposure of the overlapping URLs.
Thus, the obtained 4 historical search behavior characteristics can represent the semantic repeating degree between the association word 1 and the association word 2. In order to keep the historical search behavior characteristics up to date, statistical analysis of the click log table and exposure log table of the engine search may be initiated periodically, for example, once every month.
The second prediction layer in the embodiment of fig. 6 includes a trained second classification model, and predicts and outputs a second semantic repeating index. Assuming that the second classification model is a two-class model, taking a log-linear model as an example, the second semantic repeating index of two associated words can be predicted by the second classification model. The second classification model of the embodiment of the disclosure is simple and easy to use, and has low calculation cost and low service deployment cost while ensuring the accuracy of the prediction effect. However, the present disclosure is not limited thereto, and other two-class models may be used.
In the embodiment of fig. 6, it is assumed that the second conditional probability distribution is predicted as:
Figure 470066DEST_PATH_IMAGE055
Figure 469115DEST_PATH_IMAGE056
in the above equations (25) and (26), W is obtained by training with some samples2And b2The value of (a) is,
Figure 53725DEST_PATH_IMAGE057
i.e. there is a second conditional probability of semantic duplication for predicting the association word 1 and the association word 2. A second semantic repeat indicator may be determined based on the second conditional probability, and in some embodiments, the second semantic repeat indicator may be set equal to the second conditional probability.
According to the method for removing the duplication of the associated words, statistical analysis is carried out on a historical search exposure log table and a click log table of a search engine, historical search behavior characteristics of different associated words are obtained, a second classification model is further adopted to obtain a second semantic duplication index, the second semantic duplication index is used for being combined with the first semantic duplication index obtained by other embodiments to judge whether the semantic duplication exists or not, and the accuracy of semantic duplication judgment can be improved.
In the embodiment of fig. 7, the method for removing duplicate of an associative word provided by the embodiment of the present disclosure is illustrated in a search scenario of a browser.
Fig. 7 schematically illustrates an interface diagram of an associational word deduplication method according to an embodiment of the present disclosure. As shown in fig. 7, assuming that a user inputs a query keyword "nucleic acid" in a search box, and searches for a series of candidate associated words causing recall, by using the method for removing duplicate of associated words provided by the embodiment of the present disclosure, candidate associated words with duplicate semantics among the candidate associated words can be identified, and finally, target associated words after the duplicate removal filtering are automatically displayed only under the search box, such as "nucleic acid detection method", "nucleic acid detection cost", "validity period of nucleic acid detection", "how much money is for nucleic acid detection", "how long result is for nucleic acid detection", "nucleic acid detection service", "nucleic acid detection cost", "nucleic acid result online query", "nucleic acid detection how long result" and the like, so that candidate associated words with duplicate semantics can be removed for the user in advance, and the repetition rate of target associated words displayed by a search engine of a browser is significantly reduced, the method and the device can provide more diversified association words for the user as much as possible, help the user to find the desired information more quickly, meet more requirements of the user and further improve the adoption rate of the association words. For the displayed first target association, detailed information related to the first target association may be further displayed between the first target association and the second target association, for example, "the nucleic acid detection method is performed by searching for a respiratory tract specimen, blood or stool … of the patient". The user may click a search button behind the search box to perform the search.
It should be noted that, although the embodiment of fig. 7 is illustrated by taking a search scenario as an example, the method provided by the present disclosure may be applied to many other scenarios, such as input methods, translation, hot topic automatic detection, spelling check, plagiarism detection, and the like.
Fig. 8 schematically illustrates a block diagram of an associational de-duplication apparatus according to an embodiment of the present disclosure. As shown in fig. 8, the apparatus 800 for removing duplicates of an associative word provided by the embodiment of the present disclosure may include a candidate associative word obtaining unit 810, a semantic association feature vector obtaining unit 820, a first semantic duplication indicator obtaining unit 830, a search behavior overlap feature obtaining unit 840, a second semantic duplication indicator obtaining unit 850, and a candidate associative word removing filter unit 860.
In this embodiment of the disclosure, the candidate associated word obtaining unit 810 may be configured to perform semantic recall on the query keyword, so as to obtain a candidate associated word of the query keyword. The semantic association feature vector obtaining unit 820 may be configured to obtain a semantic association feature representation vector between different candidate association words, where the semantic association feature representation vector represents a semantic repeating degree between different candidate association words. The first semantic repeating index obtaining unit 830 may be configured to process the semantic association feature expression vectors between different candidate association words by using the first classification model, so as to obtain a first semantic repeating index between different candidate association words. The search behavior overlap feature obtaining unit 840 may be configured to obtain a historical search behavior overlap feature representation vector between different candidate association words, where the historical search behavior overlap feature representation vector represents a degree of search behavior overlap between the different candidate association words. The second semantic repeating index obtaining unit 850 may be configured to process the historical search behavior overlapping feature representation vectors between different candidate association words by using the second classification model, so as to obtain a second semantic repeating index between different candidate association words. The candidate associative word de-duplication filtering unit 860 may be configured to de-duplicate and filter candidate associative words with semantic duplication in the candidate associative words according to a first semantic duplication indicator and a second semantic duplication indicator between different candidate associative words, and determine a target associative word, so as to simultaneously display the query keyword and the target associative word.
In an exemplary embodiment, the candidate associative words may include a first associative word and a second associative word. The semantic association feature vector obtaining unit 820 may include: a distance information obtaining unit, configured to obtain distance information between the first associated word and the second associated word; a common character information obtaining unit, configured to obtain common character information between the first associated word and the second associated word; a character string length information obtaining unit, configured to obtain character string length information between the first associated word and the second associated word, where the character string length information between the first associated word and the second associated word includes at least one of a word set length difference between a first word set of the first associated word and a second word set of the second associated word, a word set length ratio between the first word set and the second word set, a character string length difference between the first associated word and the second associated word, and a character string length ratio between the first associated word and the second associated word; a semantic association feature representation vector generation unit, configured to generate a semantic association feature representation vector between the first associated word and the second associated word according to distance information between the first associated word and the second associated word, common character information, and character string length information.
In an exemplary embodiment, the distance information between the first and second associative words may include a semantic distance between the first and second associative words. Wherein the distance information obtaining unit may include: a semantic similarity obtaining unit, configured to obtain a semantic similarity between the first associated word and the second associated word; a character string common prefix length obtaining unit, configured to obtain a character string common prefix length between the first associated word and the second associated word; a similarity weight obtaining unit, configured to determine a similarity weight of the common prefix length of the character string; an editing similarity obtaining unit, configured to obtain an editing similarity between the first associated word and the second associated word according to a semantic similarity between the first associated word and the second associated word, a character string common prefix length, and a similarity weight of the character string common prefix length; the semantic distance generating unit may be configured to obtain a semantic distance between the first associated word and the second associated word according to the editing similarity between the first associated word and the second associated word.
In an exemplary embodiment, the semantic similarity obtaining unit may include: a character string length obtaining unit, configured to obtain a first character string length of the first associated word and a second character string length of the second associated word; a matching character number obtaining unit, configured to obtain a number of matching characters between the first associative word and the second associative word; a character conversion number obtaining unit, configured to obtain a number of character conversions between the first associated word and the second associated word; a semantic similarity generating unit, configured to obtain a semantic similarity between the first associated word and the second associated word according to the first character string length, the second character string length, the number of matched characters, and the number of character conversion times.
In an exemplary embodiment, the distance information between the first and second suggested words may include an edit distance between the first and second suggested words. Wherein the distance information obtaining unit may include: a minimum editing operation number obtaining unit, configured to obtain a minimum number of editing operations required to convert the first suggested word and the second suggested word from one to another; an editing distance generating unit, configured to obtain an editing distance between the first suggested word and the second suggested word according to the minimum number of times of editing operations.
In an exemplary embodiment, the distance information between the first and second suggested words may include a similar distance between the first and second suggested words. Wherein the distance information obtaining unit may include: a word set obtaining unit, configured to obtain a first word set of the first associated word and a second word set of the second associated word, where the first word set includes non-repeating words in the first associated word, and the second word set includes non-repeating words in the second associated word; an intersection element number obtaining unit, configured to obtain an intersection element number between the first word set and the second word set; a union element number obtaining unit, configured to obtain a union element number between the first word set and the second word set; a similarity coefficient obtaining unit, configured to obtain a similarity coefficient between the first associated word and the second associated word according to the intersection element number and the union element number between the first word set and the second word set; the similarity distance generating unit may be configured to obtain a similarity distance between the first associated word and the second associated word according to a similarity coefficient between the first associated word and the second associated word.
In an exemplary embodiment, the common character information between the first and second associative words may include at least one of a longest string common prefix length, a common word proportion, a word set union set, a word set intersection, and the like between the first and second associative words.
In an exemplary embodiment, when the common character information between the first and second associative words includes a common word proportion between the first and second associative words, the common character information obtaining unit may include: a word sequence obtaining unit, configured to obtain a first word sequence of the first associated word and a second word sequence of the second associated word; the first character common word obtaining unit may be configured to obtain a first character common word according to the number of characters in the first word sequence that belong to the second word sequence; the second character common word obtaining unit may be configured to obtain a second character common word according to the number of characters in the second word sequence that belong to the first word sequence; a common word length obtaining unit, configured to obtain a common word length according to a length of the first character common word and a length of the second character common word; a word sequence length obtaining unit, configured to obtain a word sequence length according to a length of the first word sequence and a length of the second word sequence; and the public word proportion obtaining unit can be used for obtaining the public word proportion between the first associated word and the second associated word according to the public word length and the word sequence length.
In an exemplary embodiment, the string length information between the first and second associated words may include at least one of a word set length difference between a first word set of the first associated word and a second word set of the second associated word, a word set length ratio between the first word set and the second word set, a string length difference between the first and second associated words, a string length ratio between the first and second associated words, and the like.
In an exemplary embodiment, the candidate associative words include a first associative word and a second associative word. The search behavior overlapping feature obtaining unit 840 may include: the first click exposure information obtaining unit can be used for obtaining the first association word as a search keyword, a first history click webpage link and a first history click amount thereof in a preset time period, and the first history exposure webpage link and a first history exposure amount thereof; a second click exposure information obtaining unit, configured to obtain the second associated word as a search keyword, a second history click web page link and a second history click amount thereof, and a second history exposure web page link and a second history exposure amount thereof in the predetermined time period; the click exposure overlapping degree obtaining unit may be configured to obtain a click web page link overlapping degree and a web page link click overlapping degree between the first history click web page link and the second history click web page link, and an exposure web page link overlapping degree and a web page link exposure overlapping degree between the first history click web page link and the second history click web page link according to the first history click web page link and a first history click amount thereof, the first history exposure web page link and a first history exposure amount thereof, the second history click web page link and a second history exposure amount thereof; and the search behavior overlap feature generation unit may be configured to generate a historical search behavior overlap feature expression vector between different candidate association words according to the clicked web page link overlap degree and the web page link click overlap degree between the first historical clicked web page link and the second historical clicked web page link, and the exposed web page link overlap degree and the web page link exposure overlap degree between the first historical exposed web page link and the second historical exposed web page link.
In an exemplary embodiment, the candidate associative words may include a first associative word and a second associative word. The candidate associative word de-duplication filtering unit 860 may include a union set unit or a weighted sum unit, among others. The merging set unit may be configured to determine that semantic duplication exists between the first associated word and the second associated word if a first semantic duplication indicator between the first associated word and the second associated word is greater than a first threshold or a second semantic duplication indicator between the first associated word and the second associated word is greater than a second threshold, and select the first associated word or the second associated word as the target associated word. The weighted summation unit may be configured to determine a target semantic repetition index between the first associated word and the second associated word according to a first semantic repetition index and a second semantic repetition index between the first associated word and the second associated word; if the target semantic repetition index between the first associated word and the second associated word is larger than a target threshold value, judging that semantic repetition exists between the first associated word and the second associated word, and selecting the first associated word or the second associated word as the target associated word.
Other contents of the associational word de-duplication apparatus of the embodiments of the present disclosure may refer to the above-described embodiments.
The method for removing duplicates of associative words provided by the embodiment of the present disclosure may be implemented by combining a block chain (Blockchain) technology, for example, a classification model trained in advance may be stored in a distributed manner in a block chain, distance information between a first associative word and a second associative word, common character information, character string length information, a semantic association feature representation vector, a history search behavior overlap feature representation vector, and the like obtained by calculation may also be stored in a distributed manner in a block chain, and a prediction result of whether semantic duplication exists between any two candidate associative words may also be stored in a block chain.
The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism and an encryption algorithm. The block chain, which is essentially a decentralized database, is a string of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, which is used for verifying the validity (anti-counterfeiting) of the information and generating a next block. The blockchain may include a blockchain underlying platform, a platform product services layer, and an application services layer.
The block chain underlying platform can comprise processing modules such as user management, basic service, intelligent contract and operation monitoring. The user management module is responsible for identity information management of all blockchain participants, and comprises public and private key generation maintenance (account management), key management, user real identity and blockchain address corresponding relation maintenance (authority management) and the like, and under the authorization condition, the user management module supervises and audits the transaction condition of certain real identities and provides rule configuration (wind control audit) of risk control; the basic service module is deployed on all block chain node equipment and used for verifying the validity of the service request, recording the service request to storage after consensus on the valid request is completed, for a new service request, the basic service firstly performs interface adaptation analysis and authentication processing (interface adaptation), then encrypts service information (consensus management) through a consensus algorithm, transmits the service information to a shared account (network communication) completely and consistently after encryption, and performs recording and storage; the intelligent contract module is responsible for registering and issuing contracts, triggering the contracts and executing the contracts, developers can define contract logics through a certain programming language, issue the contract logics to a block chain (contract registration), call keys or other event triggering and executing according to the logics of contract clauses, complete the contract logics and simultaneously provide the function of upgrading and canceling the contracts; the operation monitoring module is mainly responsible for deployment, configuration modification, contract setting, cloud adaptation in the product release process and visual output of real-time states in product operation, such as: alarm, monitoring network conditions, monitoring node equipment health status, and the like.
The platform product service layer provides basic capability and an implementation framework of typical application, and developers can complete block chain implementation of business logic based on the basic capability and the characteristics of the superposed business. The application service layer provides the application service based on the block chain scheme for the business participants to use.
Reference is now made to fig. 9, which illustrates a schematic diagram of an electronic device suitable for use in implementing embodiments of the present application. The electronic device shown in fig. 9 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application. The electronic device in fig. 9 may be, for example, a server, but the present disclosure is not limited thereto.
Referring to fig. 9, an electronic device provided in an embodiment of the present disclosure may include: a processor 101, a communication interface 102, a memory 103, and a communication bus 104.
Wherein the processor 101, the communication interface 102 and the memory 103 communicate with each other via a communication bus 104.
Alternatively, the communication interface 102 may be an interface of a communication module, such as an interface of a GSM (Global System for Mobile communications) module. The processor 101 is used to execute programs. The memory 103 is used for storing programs. The program may comprise a computer program comprising computer operating instructions.
The processor 101 may be a central processing unit CPU or an application Specific Integrated circuit asic or one or more Integrated circuits configured to implement embodiments of the present disclosure.
The memory 103 may include a Random Access Memory (RAM) memory, and may also include a non-volatile memory (non-volatile memory), such as at least one disk memory.
Among them, the procedure can be specifically used for: carrying out semantic recall on the query keywords to obtain candidate associated words of the query keywords; obtaining semantic association feature expression vectors among different candidate association words, wherein the semantic association feature expression vectors represent semantic repeating degrees among the different candidate association words; processing semantic association feature expression vectors among different candidate association words by using a first classification model to obtain a first semantic repeating index among the different candidate association words; obtaining a historical search behavior overlapping characteristic representation vector between different candidate association words, wherein the historical search behavior overlapping characteristic representation vector represents the degree of search behavior overlapping between the different candidate association words; processing the historical search behavior overlapping feature representation vectors among different candidate association words by using a second classification model to obtain a second semantic repeating index among the different candidate association words; according to a first semantic repetition index and a second semantic repetition index between different candidate association words, candidate association words with repeated semantics in the candidate association words are subjected to de-duplication filtering, a target association word is determined, and the query keyword and the target association word are displayed at the same time.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (12)

1. A method for de-duplicating an associative word, comprising:
carrying out semantic recall on the query keywords to obtain candidate associated words of the query keywords;
obtaining semantic association feature expression vectors among different candidate association words, wherein the semantic association feature expression vectors represent semantic repeating degrees among the different candidate association words;
processing semantic association feature expression vectors among different candidate association words by using a first classification model to obtain a first semantic repeating index among the different candidate association words;
obtaining a historical search behavior overlapping characteristic representation vector between different candidate association words, wherein the historical search behavior overlapping characteristic representation vector represents the degree of search behavior overlapping between the different candidate association words;
processing the historical search behavior overlapping feature representation vectors among different candidate association words by using a second classification model to obtain a second semantic repeating index among the different candidate association words;
according to a first semantic repetition index and a second semantic repetition index between different candidate association words, candidate association words with repeated semantics in the candidate association words are subjected to de-duplication filtering, a target association word is determined, and the query keyword and the target association word are displayed at the same time.
2. The method of claim 1, wherein the candidate associative words comprise a first associative word and a second associative word; obtaining semantic association feature expression vectors among different candidate association words comprises the following steps:
obtaining distance information between the first associated word and the second associated word;
obtaining common character information between the first associated word and the second associated word;
obtaining string length information between the first and second associated words, wherein the string length information between the first and second associated words includes at least one of a word set length difference between a first word set of the first associated word and a second word set of the second associated word, a word set length ratio between the first and second word sets, a string length difference between the first and second associated words, and a string length ratio between the first and second associated words;
and generating a semantic association feature expression vector between the first associated word and the second associated word according to the distance information between the first associated word and the second associated word, the common character information and the character string length information.
3. The method of claim 2, wherein the distance information between the first and second suggested words comprises a semantic distance between the first and second suggested words; obtaining distance information between the first associated word and the second associated word, wherein the obtaining of the distance information comprises:
obtaining semantic similarity between the first associated word and the second associated word;
obtaining a common prefix length of a character string between the first associated word and the second associated word;
determining similarity weight of the common prefix length of the character string;
obtaining editing similarity between the first association word and the second association word according to semantic similarity between the first association word and the second association word, the common prefix length of the character string and the similarity weight of the common prefix length of the character string;
and obtaining the semantic distance between the first associated word and the second associated word according to the editing similarity between the first associated word and the second associated word.
4. The method of claim 3, wherein obtaining semantic similarity between the first and second suggested words comprises:
obtaining a first character string length of the first associated word and a second character string length of the second associated word;
obtaining the number of matched characters between the first associated word and the second associated word;
obtaining the number of character conversion times between the first associated word and the second associated word;
and obtaining the semantic similarity between the first associative word and the second associative word according to the first character string length, the second character string length, the number of the matched characters and the number of character conversion times.
5. The method according to any one of claims 2 to 4, wherein the distance information between the first and second associative words includes an edit distance between the first and second associative words; obtaining distance information between the first associated word and the second associated word, wherein the obtaining of the distance information comprises:
obtaining the minimum number of editing operations required for converting the first associative word and the second associative word from one to the other;
and obtaining the editing distance between the first associative word and the second associative word according to the minimum number of editing operations.
6. The method according to any one of claims 2 to 4, wherein the distance information between the first and second associative words includes a similar distance between the first and second associative words; obtaining distance information between the first associated word and the second associated word, wherein the obtaining of the distance information comprises:
obtaining a first character set of the first associated word and a second character set of the second associated word, wherein the first character set comprises non-repeated characters in the first associated word, and the second character set comprises non-repeated characters in the second associated word;
obtaining the number of intersection elements between the first word set and the second word set;
acquiring the number of union elements between the first word set and the second word set;
obtaining a similarity coefficient between the first associative word and the second associative word according to the intersection element number and the union element number between the first word set and the second word set;
and obtaining the similar distance between the first associated word and the second associated word according to the similar coefficient between the first associated word and the second associated word.
7. The method of claim 2, wherein the common character information between the first and second associated words comprises at least one of a longest string common prefix length, a common word proportion, a word set union set, and a word set intersection set between the first and second associated words; wherein, when the common character information between the first associated word and the second associated word includes the common word ratio between the first associated word and the second associated word, obtaining the common character information between the first associated word and the second associated word includes:
obtaining a first word sequence of the first associative word and a second word sequence of the second associative word;
obtaining a first character public word according to the number of characters in the first character sequence, which belong to the second character sequence;
obtaining a second character public word according to the number of characters in the second character sequence, which belong to the first character sequence;
obtaining the length of the common word according to the length of the first character common word and the length of the second character common word;
obtaining the length of the word sequence according to the length of the first word sequence and the length of the second word sequence;
and obtaining the public word proportion between the first associative word and the second associative word according to the public word length and the word sequence length.
8. The method of claim 1, wherein the candidate associative words comprise a first associative word and a second associative word; obtaining the overlapping characteristic expression vector of the historical search behaviors among different candidate association words comprises the following steps:
acquiring the first association words as search keywords, and acquiring a first history click webpage link and a first history click quantity thereof, a first history exposure webpage link and a first history exposure quantity thereof in a preset time period;
acquiring a second suggested word as a search keyword, and acquiring a second history click webpage link and a second history click quantity thereof, and a second history exposure webpage link and a second history exposure thereof in the preset time period;
acquiring click webpage link overlapping degree and webpage link click overlapping degree between the first history click webpage link and the second history click webpage link, and exposure webpage link overlapping degree and webpage link exposure overlapping degree between the first history click webpage link and the second history click webpage link according to the first history click webpage link and the first history click rate thereof, the first history exposure webpage link and the first history exposure amount thereof, the second history click webpage link and the second history exposure amount thereof, and the second history exposure webpage link and the second history exposure amount thereof;
and generating historical search behavior overlapping feature expression vectors among different candidate association words according to the clicked webpage link overlapping degree and the webpage link clicking overlapping degree between the first historical clicked webpage link and the second historical clicked webpage link, and the exposed webpage link overlapping degree and the webpage link exposing overlapping degree between the first historical exposed webpage link and the second historical exposed webpage link.
9. The method of claim 1, wherein the candidate associative words comprise a first associative word and a second associative word; the method for determining the target association words by filtering the candidate association words with repeated semantics according to a first semantic repetition index and a second semantic repetition index between different candidate association words includes the following steps:
if a first semantic repetition index between the first associated word and the second associated word is greater than a first threshold value or the second semantic repetition index is greater than a second threshold value, determining that semantic repetition exists between the first associated word and the second associated word, and selecting the first associated word or the second associated word as the target associated word; alternatively, the first and second electrodes may be,
determining a target semantic repetition index between the first associated word and the second associated word according to a first semantic repetition index and a second semantic repetition index between the first associated word and the second associated word;
if the target semantic repetition index between the first associated word and the second associated word is larger than a target threshold value, judging that semantic repetition exists between the first associated word and the second associated word, and selecting the first associated word or the second associated word as the target associated word.
10. An apparatus for removing duplicate association words, comprising:
the candidate associated word obtaining unit is used for carrying out semantic recall on the query key words to obtain candidate associated words of the query key words;
the semantic association feature vector obtaining unit is used for obtaining semantic association feature expression vectors among different candidate association words, and the semantic association feature expression vectors express semantic repeating degrees among the different candidate association words;
the first semantic repeating index obtaining unit is used for processing semantic association feature expression vectors among different candidate association words by using a first classification model to obtain first semantic repeating indexes among the different candidate association words;
the search behavior overlap feature obtaining unit is used for obtaining historical search behavior overlap feature expression vectors among different candidate association words, and the historical search behavior overlap feature expression vectors represent the search behavior overlap degree among the different candidate association words;
the second semantic repeating index obtaining unit is used for processing the historical search behavior overlapping feature expression vectors among different candidate association words by using a second classification model to obtain second semantic repeating indexes among the different candidate association words;
and the candidate associated word duplicate removal filtering unit is used for removing duplicate of candidate associated words with repeated semantics in the candidate associated words according to a first semantic duplicate index and a second semantic duplicate index between different candidate associated words, determining target associated words and simultaneously displaying the query keyword and the target associated words.
11. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 9.
12. An electronic device, comprising:
at least one processor;
a storage device configured to store at least one program that, when executed by the at least one processor, causes the at least one processor to implement the method of any one of claims 1 to 9.
CN202110368415.4A 2021-04-06 2021-04-06 Method and device for removing duplicate of associated word, computer readable storage medium and electronic equipment Active CN112765966B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110368415.4A CN112765966B (en) 2021-04-06 2021-04-06 Method and device for removing duplicate of associated word, computer readable storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110368415.4A CN112765966B (en) 2021-04-06 2021-04-06 Method and device for removing duplicate of associated word, computer readable storage medium and electronic equipment

Publications (2)

Publication Number Publication Date
CN112765966A true CN112765966A (en) 2021-05-07
CN112765966B CN112765966B (en) 2021-07-23

Family

ID=75691152

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110368415.4A Active CN112765966B (en) 2021-04-06 2021-04-06 Method and device for removing duplicate of associated word, computer readable storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN112765966B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113407965A (en) * 2021-06-17 2021-09-17 海南海锐众创科技有限公司 Deposit certificate document encryption system
CN115314737A (en) * 2021-05-06 2022-11-08 青岛聚看云科技有限公司 Content display method, display equipment and server

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130013591A1 (en) * 2011-07-08 2013-01-10 Microsoft Corporation Image re-rank based on image annotations
US20130297545A1 (en) * 2012-05-04 2013-11-07 Pearl.com LLC Method and apparatus for identifying customer service and duplicate questions in an online consultation system
CN107958078A (en) * 2017-12-13 2018-04-24 北京百度网讯科技有限公司 Information generating method and device
CN109189990A (en) * 2018-07-25 2019-01-11 北京奇艺世纪科技有限公司 A kind of generation method of search term, device and electronic equipment
CN110377817A (en) * 2019-06-13 2019-10-25 百度在线网络技术(北京)有限公司 Search entry method for digging and device and its application in multimedia resource
CN111125344A (en) * 2019-12-23 2020-05-08 北大方正集团有限公司 Related word recommendation method and device
CN111897926A (en) * 2020-08-04 2020-11-06 广西财经学院 Chinese query expansion method integrating deep learning and expansion word mining intersection
CN112328889A (en) * 2020-11-23 2021-02-05 北京字节跳动网络技术有限公司 Method and device for determining recommended search terms, readable medium and electronic equipment

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130013591A1 (en) * 2011-07-08 2013-01-10 Microsoft Corporation Image re-rank based on image annotations
US20130297545A1 (en) * 2012-05-04 2013-11-07 Pearl.com LLC Method and apparatus for identifying customer service and duplicate questions in an online consultation system
CN107958078A (en) * 2017-12-13 2018-04-24 北京百度网讯科技有限公司 Information generating method and device
CN109189990A (en) * 2018-07-25 2019-01-11 北京奇艺世纪科技有限公司 A kind of generation method of search term, device and electronic equipment
CN110377817A (en) * 2019-06-13 2019-10-25 百度在线网络技术(北京)有限公司 Search entry method for digging and device and its application in multimedia resource
CN111125344A (en) * 2019-12-23 2020-05-08 北大方正集团有限公司 Related word recommendation method and device
CN111897926A (en) * 2020-08-04 2020-11-06 广西财经学院 Chinese query expansion method integrating deep learning and expansion word mining intersection
CN112328889A (en) * 2020-11-23 2021-02-05 北京字节跳动网络技术有限公司 Method and device for determining recommended search terms, readable medium and electronic equipment

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115314737A (en) * 2021-05-06 2022-11-08 青岛聚看云科技有限公司 Content display method, display equipment and server
CN113407965A (en) * 2021-06-17 2021-09-17 海南海锐众创科技有限公司 Deposit certificate document encryption system
CN113407965B (en) * 2021-06-17 2022-04-22 海南海锐众创科技有限公司 Deposit certificate document encryption system

Also Published As

Publication number Publication date
CN112765966B (en) 2021-07-23

Similar Documents

Publication Publication Date Title
Chen et al. Call attention to rumors: Deep attention based recurrent neural networks for early rumor detection
US11023505B2 (en) Method and apparatus for pushing information
CN111831802B (en) Urban domain knowledge detection system and method based on LDA topic model
WO2019043379A1 (en) Fact checking
Win et al. Target oriented tweets monitoring system during natural disasters
CN109271514B (en) Generation method, classification method, device and storage medium of short text classification model
WO2014160282A1 (en) Classifying resources using a deep network
Riadi Detection of cyberbullying on social media using data mining techniques
CN112765966B (en) Method and device for removing duplicate of associated word, computer readable storage medium and electronic equipment
Hsu et al. Integrating machine learning and open data into social Chatbot for filtering information rumor
Wu et al. Extracting topics based on Word2Vec and improved Jaccard similarity coefficient
CN110956021A (en) Original article generation method, device, system and server
CN114385780B (en) Program interface information recommendation method and device, electronic equipment and readable medium
Song et al. Improving neural named entity recognition with gazetteers
Mahata et al. From chirps to whistles: discovering event-specific informative content from twitter
CN113515589A (en) Data recommendation method, device, equipment and medium
Zhu et al. CCBLA: a lightweight phishing detection model based on CNN, BiLSTM, and attention mechanism
WO2015084757A1 (en) Systems and methods for processing data stored in a database
US20220366295A1 (en) Pre-search content recommendations
EP3635575A1 (en) Sibling search queries
CN113010771A (en) Training method and device for personalized semantic vector model in search engine
WO2023048807A1 (en) Hierarchical representation learning of user interest
Jain et al. Review on analysis of classifiers for fake news detection
Gupta et al. Document summarisation based on sentence ranking using vector space model
Liu et al. A Graph Convolutional Network‐Based Sensitive Information Detection Algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40044608

Country of ref document: HK