CN115840845A - Webpage retrieval method and related equipment - Google Patents

Webpage retrieval method and related equipment Download PDF

Info

Publication number
CN115840845A
CN115840845A CN202111101017.2A CN202111101017A CN115840845A CN 115840845 A CN115840845 A CN 115840845A CN 202111101017 A CN202111101017 A CN 202111101017A CN 115840845 A CN115840845 A CN 115840845A
Authority
CN
China
Prior art keywords
text
semantic
target
text blocks
webpage
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111101017.2A
Other languages
Chinese (zh)
Inventor
胡兰
陈忠富
马中瑞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN202111101017.2A priority Critical patent/CN115840845A/en
Priority to PCT/CN2022/118346 priority patent/WO2023040808A1/en
Publication of CN115840845A publication Critical patent/CN115840845A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9532Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application discloses a webpage retrieval method and related equipment, which are applied to the semantic search field in the artificial intelligence field and used for saving resource overhead in the retrieval process. The method in the embodiment of the application comprises the following steps: in the stage of constructing the retrieval index, firstly, a plurality of webpage texts are segmented by taking sentences as units to obtain a plurality of text blocks; and then, a plurality of target text blocks are screened out from the plurality of text blocks according to the characteristic indexes of the text blocks, wherein the characteristic indexes are used for indicating the association characteristics among the text blocks or the association characteristics and the respective semantic characteristics of the plurality of text blocks, and the association characteristics are used for indicating the association between the semantics of at least two text blocks to store the plurality of target text blocks. In the embodiment, a plurality of target text blocks are screened from a plurality of text blocks and are used as retrieval indexes, so that the number of the retrieval indexes is greatly reduced, and the resource overhead in the retrieval process is saved.

Description

Webpage retrieval method and related equipment
Technical Field
The application relates to the field of artificial intelligent semantic retrieval, in particular to a webpage retrieval method and related equipment.
Background
Retrieval is one of the key technologies in the internet field. Semantic retrieval is an information retrieval process used by modern search engines to return the retrieval results most relevant to the query statement entered by the user. The semantic retrieval focuses on the intention of retrieving query sentences, and compared with the traditional keyword matching retrieval, the semantic retrieval results are more accurate.
The current semantic retrieval method comprises the steps of firstly, carrying out fixed character length segmentation on each webpage text in a webpage text library to obtain a plurality of text blocks corresponding to each webpage text, then respectively calculating semantic vectors of each text block and an inquiry statement by using a depth representation model, taking the similarity of the semantic vector of each text block and the semantic vector of the inquiry statement as a judgment basis of the matching degree between the text block and the inquiry statement, and taking the webpage text to which the text block with the highest matching degree belongs as a retrieval result.
In the current method, because each webpage text needs to be segmented by a fixed character length, the truncation of semantic information is easily caused, so that the problems of incomplete text information expression of text blocks, damaged webpage text context information and the like are solved.
Disclosure of Invention
The embodiment of the application provides a webpage retrieval method and related equipment, which are used for saving resource overhead in a retrieval process.
In a first aspect, an embodiment of the present application provides a web page retrieval method, where the method is applied to an electronic device (e.g., a server), and the method may include: firstly, in a retrieval index construction stage, the electronic equipment divides each webpage text in a plurality of webpage texts by taking sentences as units to obtain a plurality of text blocks, wherein each webpage text corresponds to a plurality of text blocks; then, the electronic equipment screens out a plurality of target text blocks from the plurality of text blocks according to feature indexes of the text blocks, wherein each target text block is used as a retrieval index, the feature indexes are used for indicating association features among the plurality of text blocks, or the association features and semantic features of the text blocks in the webpage text, and the association features are used for indicating association among semantics of at least two text blocks; and storing the target text blocks, wherein the target text blocks are used for searching the webpage texts. In the embodiment, the electronic device firstly performs the segmentation of each webpage text with the sentence as the unit to ensure the completeness of the expression semantics of the text block, screens out the target text block from the plurality of text blocks according to the characteristic index of the text block, and uses the target text block as the retrieval index, so that the number of the retrieval indexes is greatly reduced, the resource overhead of the retrieval process is saved, meanwhile, the semantic completeness of the target text block is reserved, and the retrieval quality is improved.
In an optional implementation manner, in a webpage matching stage, the electronic device receives a query statement input by a user; and matching the first semantic vector of the query statement with a plurality of second semantic vectors to output a target webpage text matched with the query statement, wherein the plurality of second semantic vectors are semantic vectors corresponding to a plurality of target text blocks screened from the plurality of webpage texts.
In an alternative implementation, matching the first semantic vector with a plurality of second semantic vectors to output the target web page text matched with the query sentence may specifically include: the electronic device determines a plurality of first semantic similarities, each of the first semantic similarities indicating a similarity of a first semantic vector of the query statement and one of a plurality of second semantic vectors, the plurality of first semantic similarities corresponding to different second semantic vectors, the first semantic vector being a semantic vector of the query statement; then, outputting a target webpage text according to at least one sorting order and at least one relative position; wherein each ranking order is used to indicate a ranking of a corresponding first semantic similarity of one of the plurality of target text blocks, and each relative position is used to indicate a distance of positions of at least two of the plurality of target text blocks in one web page text. In this embodiment, in the process of matching the web page text by the electronic device, the similarity between the target text block and the query statement is related, and the relative positions of at least two target text blocks in the same web page text are also related, where the size of the relative positions can indicate whether semantics expressed by the at least two target text blocks are compact (or centralized). It can be understood that the more similar the semantics of the query sentence are, the more closely the at least two target text blocks are located in the web page text, the closer the topic expressed by the web page text or a part of the content in the web page text may be to the semantics of the query sentence, the more closely the web page text may meet the user requirement, and the higher the retrieval quality is.
In an optional implementation manner, the outputting the target webpage text according to at least one ranking order and at least one relative position may specifically include: outputting the target webpage text based on a plurality of fourth semantic similarities, each fourth semantic similarity indicating a similarity of the first semantic vector and one of a plurality of third semantic vectors, wherein each third semantic vector is a corresponding semantic vector of the plurality of webpage texts, and each third semantic vector is related to a semantic vector corresponding to a plurality of target text blocks in the corresponding webpage text, positions of the plurality of target text blocks in the corresponding webpage text, and a ranking rank of the plurality of target text blocks in the corresponding webpage text. In the embodiment, the third semantic vector of the webpage text is determined according to the second semantic vector, the sorting order and the position information of the target text block, and the semantic similarity between the query statement and the webpage text is determined on the webpage text level, so that the target webpage text is determined, and the retrieval quality is improved.
In an alternative implementation, the associated features include a degree of repetition; the repetition degree is used for indicating the repetition degree of the two text blocks with the association relation expressing the subjects of the texts of the webpage. In this embodiment, the electronic device screens out a plurality of target text blocks from the plurality of text blocks according to the repetition degree, so that redundancy of the retrieval index caused by at least two text blocks expressing one topic is reduced, and the number of the retrieval indexes is reduced.
In an alternative implementation, the semantic features include importance, where the importance is used to indicate how important the text block plays a role in expressing a topic of the webpage text, and the electronic device may select, according to the feature indicators, a plurality of target text blocks from the plurality of text blocks by: the electronic equipment screens out a plurality of target text blocks from a plurality of text blocks according to the repetition degree and the importance degree, and can ensure that the redundancy of the retrieval index caused by that at least two text blocks express one theme is reduced and the number of the retrieval indexes is reduced on the basis of not losing important information in each webpage text.
In an alternative implementation, the repetition degree is related to a second semantic similarity and mutual information between the two text blocks, the second semantic similarity indicates a similarity degree of semantics of the two text blocks having an association relationship, and the mutual information indicates a degree of interdependency between the two text blocks. In this embodiment, the higher the semantic similarity between two text blocks, the more relevant the semantic level between the two text blocks. The higher the mutual information between two text blocks, the higher the degree of dependency of the two text blocks on each other, and the lower the mutual information, the higher the degree of independence of the two text blocks from each other. The repetition degree is measured from two levels through the semantic similarity and mutual information between two text blocks, wherein the semantic similarity relates to the semantic level, the mutual information relates to the association degree level of the two text blocks, and the mutual information can find the common occurrence condition of words from the statistical angle, so that whether semantic correlation or topic correlation exists between the two text blocks is analyzed, the accuracy of measuring the repetition degree of the two text blocks is improved, and the quality of constructing a retrieval index is improved.
In an optional implementation manner, the semantic features further include an existence probability, where the existence probability is used to indicate a repetition degree of the text block expressing semantics in the webpage text to which the text block belongs, and the higher the existence probability is, the lower the repetition degree is, the method further includes: acquiring a hot sentence, wherein the hot sentence is a retrieval sentence with retrieval frequency higher than a preset value; the importance of the text block is related to the retrieval frequency of the hot spot sentences, the third semantic similarity of the text block and the hot spot sentences and the existence probability of the text block. In this embodiment, the hot sentence is a historical search sentence frequently used by the user, and it is easier to click a web page based on the search sentence, which means that the historical search sentence itself may be a keyword of the text of the web page or a sentence carrying the keyword. If the similarity of a text block and the hot sentence is high, the similarity of the text block and the key sentence (keyword) in the webpage text is high, and the existence probability of the text block is higher (the semantic repetition degree of the text block in the webpage text is lower), the text block is more important for expressing the theme of the webpage text. In the embodiment, the importance of the text block is determined by combining a plurality of angles, and the effectiveness of determining the importance of the text block is improved.
In an optional implementation manner, the association relationship includes: there is at least one of the same anchor text or entity, there is overlapping text information, and there is a context structure.
In a second aspect, an embodiment of the present application provides a web page retrieval method, where the method is applied to an electronic device, and the method may include: receiving a query statement; determining a plurality of semantic similarities, wherein each semantic similarity indicates the similarity of a first semantic vector and one of a plurality of second semantic vectors, and the plurality of first semantic similarities correspond to different second semantic vectors, wherein the first semantic vector is a semantic vector of a query statement, and the plurality of second semantic vectors are semantic vectors corresponding to a plurality of target text blocks segmented by a plurality of web page texts; the text block is obtained by segmenting a webpage text by taking a sentence as a unit; then, the electronic equipment outputs a target webpage text according to the ranking order and the relative position; wherein each ranking order is used for indicating the ranking of the corresponding semantic similarity of one of the target text blocks, and each relative position is used for indicating the distance between the positions of at least two of the target text blocks in the webpage text. In this embodiment, in the process of matching the web page text by the electronic device, the similarity between the target text block and the query statement is related, and the relative positions of at least two target text blocks in the same web page text are also related, where the size of the relative position can express whether semantics expressed by the at least two target text blocks are compact (or centralized). It can be understood that, the more similar the semantics of the query sentence are, the more closely the relative position in the web page text is, the more closely the subject expressed by the web page text or the part of the content in the web page text may be to the semantics of the query sentence, the more closely the web page text may satisfy the user's requirement, thereby improving the retrieval quality.
In an alternative implementation, outputting at least one target web page text according to at least one ranking order and at least one relative position comprises: outputting the target webpage text based on a plurality of fourth semantic similarities, each of the fourth semantic similarities indicating a similarity of the first semantic vector and one of a plurality of third semantic vectors, wherein each of the third semantic vectors is a corresponding semantic vector of the plurality of webpage texts, each of the third semantic vectors is a semantic vector corresponding to a plurality of target text blocks in the corresponding webpage text, positions of the plurality of target text blocks in the corresponding webpage text, and rank of the plurality of target text blocks in the corresponding webpage text. In the embodiment, the third semantic vector of the webpage text is determined according to the second semantic vector, the sorting order and the position information of the target text block, and the semantic similarity between the query statement and the webpage text is determined on the webpage text level, so that the target webpage text is determined, and the retrieval quality is improved.
In an optional implementation manner, the method may further include: segmenting a plurality of webpage texts by taking sentences as units to obtain a plurality of text blocks; screening a plurality of target text blocks from the plurality of text blocks according to feature indexes of the plurality of text blocks, wherein the feature indexes are used for indicating correlation features between the text blocks or the correlation features and respective semantic features of the plurality of text blocks, and the correlation features are used for indicating correlation between semantics of at least two text blocks; and storing a plurality of target text blocks, wherein the target text blocks are used for searching a plurality of webpage texts. In the embodiment, the electronic device firstly performs the segmentation of each webpage text with the sentence as the unit to ensure the completeness of the expression semantics of the text block, screens out the target text block from the plurality of text blocks according to the characteristic index of the text block, and uses the target text block as the retrieval index, so that the number of the retrieval indexes is greatly reduced, the resource overhead of the retrieval process is saved, meanwhile, the semantic completeness of the target text block is reserved, and the retrieval quality is improved.
In an alternative implementation, the associated features include a degree of repetition; the repetition degree is used for indicating the repetition degree of the two text blocks with the association relation expressing the subjects of the texts of the webpage. In this embodiment, the electronic device screens out a plurality of target text blocks from the plurality of text blocks according to the repetition degree, so that redundancy of the retrieval index caused by at least two text blocks expressing one topic is reduced, and the number of the retrieval indexes is reduced.
In an alternative implementation, the semantic features include importance, where the importance is used to indicate how important the text block plays a role in expressing a topic of the webpage text, and the electronic device may select, according to the feature indicators, a plurality of target text blocks from the plurality of text blocks by: the electronic equipment screens out a plurality of target text blocks from a plurality of text blocks according to the repetition degree and the importance degree, so that the redundancy of the retrieval index caused by the fact that at least two text blocks express one theme is reduced on the basis of not losing important information in each webpage text, and the number of the retrieval indexes is reduced.
In an alternative implementation, the degree of repetition relates to a second semantic similarity between two text blocks having an association relationship, and mutual information between the two text blocks, the mutual information indicating a degree of interdependency between the two text blocks. In this embodiment, the higher the semantic similarity between two text blocks, the more relevant the semantic level between the two text blocks. The higher the mutual information between two text blocks, the higher the degree of dependency of the two text blocks on each other, and the lower the mutual information, the higher the degree of independence of the two text blocks from each other. The repetition degree is measured from two levels through the semantic similarity and mutual information between two text blocks, wherein the semantic similarity relates to the semantic level, the mutual information relates to the association degree level of the two text blocks, and the mutual information can find the common occurrence condition of words from the statistical angle, so that whether semantic correlation or topic correlation exists between the two text blocks is analyzed, the accuracy of measuring the repetition degree of the two text blocks is improved, and the quality of constructing a retrieval index is improved.
In an optional implementation manner, the semantic features further include an existence probability, where the existence probability is used to indicate a repetition degree of the text block expressing semantics in the webpage text to which the text block belongs, and the higher the existence probability is, the lower the repetition degree is, the method further includes: acquiring a hot sentence, wherein the hot sentence is a retrieval sentence with retrieval frequency higher than a preset value; the importance of the text block is related to the retrieval frequency of the hot spot sentences, the third semantic similarity of the text block and the hot spot sentences and the existence probability of the text block. In this embodiment, the hot sentence is a historical search sentence frequently used by the user, and it is easier to click on a certain web page based on the search sentence, which means that the historical search sentence itself may be a keyword of the web page text or a sentence carrying the keyword. If the similarity between a text block and the hot-spot sentence is high, the similarity between the text block and the key sentence (keyword) in the webpage text is high, and the higher the probability of existence of the text block is (the lower the semantic repetition degree of the text block in the webpage text), the more important the text block is for expressing the theme of the webpage text. In the embodiment, the importance of the text block is determined by combining a plurality of angles, and the effectiveness of determining the importance of the text block is improved. In a third aspect, an embodiment of the present application provides a web page retrieval apparatus, including: the processing module is used for segmenting the plurality of webpage texts by taking sentences as units to obtain a plurality of text blocks; the processing module is further used for screening out a plurality of target text blocks from the plurality of text blocks according to the characteristic indexes of the plurality of text blocks; the characteristic index is used for indicating the association characteristic among the text blocks or the respective semantic characteristic of the association characteristic and the text blocks in the webpage text, wherein the association characteristic is used for indicating the association among the semantics; and the storage module is used for storing the target text blocks, and the target text blocks are used for retrieving the webpage texts.
In an alternative implementation, the receiving module is configured to receive a query statement; and the processing module is further used for matching the first semantic vector with a plurality of second semantic vectors so as to output a target webpage text matched with the query statement, wherein the first semantic vector is the semantic vector of the query statement, and the plurality of second semantic vectors are semantic vectors corresponding to a plurality of target text blocks screened from the plurality of webpage texts.
In an optional implementation manner, the processing module is further specifically configured to: determining a plurality of first semantic similarities, each first semantic similarity indicating a similarity of a first semantic vector of the query statement and one of a plurality of second semantic vectors, the plurality of first semantic similarities corresponding to different second semantic vectors; outputting a target webpage text according to the at least one ranking order and the at least one relative position; wherein each of the ranking ranks is used to indicate a ranking of a corresponding first semantic similarity of one of the target text blocks, and each of the relative positions is used to indicate a distance of positions of at least two of the target text blocks in one web page text.
In an optional implementation manner, the processing module is further specifically configured to: outputting the target webpage text based on a plurality of fourth semantic similarities, each of the fourth semantic similarities indicating a similarity of the first semantic vector and one of a plurality of third semantic vectors, wherein each of the third semantic vectors is a corresponding semantic vector of the plurality of webpage texts, each of the third semantic vectors is a semantic vector corresponding to a plurality of target text blocks in the corresponding webpage text, positions of the plurality of target text blocks in the corresponding webpage text, and rank of the plurality of target text blocks in the corresponding webpage text.
In an alternative implementation, the associated features include a degree of repetition; the repetition degree is used for indicating the repetition degree of the two text blocks with the incidence relation expressing the subjects of the webpage texts; and the processing module is also used for screening a plurality of target text blocks from the plurality of text blocks according to the repetition degree.
In an alternative implementation, the semantic features include importance, which is used to indicate the importance of the text block in expressing the topic of the web page text; and the processing module is also used for screening out a plurality of target text blocks from the plurality of text blocks according to the repetition degree and the importance degree.
In an alternative implementation, the repetition degree is related to a second semantic similarity and mutual information between the two text blocks, the second semantic similarity indicates a similarity degree of semantics of the two text blocks having an association relationship, and the mutual information indicates a degree of interdependency between the two text blocks.
In an optional implementation manner, the semantic features further include an existence probability, where the existence probability is used to indicate a repetition degree of the text block expressing semantics in the webpage text to which the text block belongs; the processing module is further used for acquiring a hot statement, wherein the hot statement is a retrieval statement with retrieval frequency higher than a preset value; the importance of the text block is related to the retrieval frequency of the hot spot sentences, the third semantic similarity of the text block and the hot spot sentences and the existence probability of the text block.
In an optional implementation manner, the association relationship includes: there is at least one of the same anchor text or entity, there is overlapping text information, and there is a context structure.
In a fourth aspect, an embodiment of the present application provides a web page retrieval apparatus, including: a receiving module, configured to receive a query statement; the processing module is used for acquiring a plurality of semantic similarities, each semantic similarity indicates the similarity of a first semantic vector and one second semantic vector in a plurality of second semantic vectors, the plurality of first semantic similarities correspond to different second semantic vectors, the first semantic vector is a semantic vector of a query statement, the plurality of second semantic vectors are semantic vectors corresponding to a plurality of target text blocks segmented by a plurality of webpage texts, and each target text block indicates at least one sentence in the webpage text; the processing module is also used for outputting at least one target webpage text according to at least one sorting order and at least one relative position; wherein each ranking order is used to indicate a ranking of semantic similarity corresponding to one of the plurality of target text blocks, and each relative position is used to indicate a distance of positions of at least two of the plurality of target text blocks in one web page text.
In an optional implementation manner, the processing module is further specifically configured to: outputting the target webpage text based on a plurality of fourth semantic similarities, each of the fourth semantic similarities indicating a similarity of one of the first semantic vector and a plurality of third semantic vectors, wherein each of the third semantic vectors is a corresponding semantic vector of the plurality of webpage texts, each of the third semantic vectors is a semantic vector corresponding to a plurality of target text blocks in a corresponding webpage text, positions of the plurality of target text blocks in the corresponding webpage text, and rank of the plurality of target text blocks in the corresponding webpage text.
In an optional implementation manner, the processing module is further specifically configured to: segmenting a plurality of webpage texts by taking sentences as units to obtain a plurality of text blocks; and screening a plurality of target text blocks from the plurality of text blocks according to feature indexes of the plurality of text blocks, wherein the feature indexes are used for indicating association features between the text blocks or semantic features of the association features and the text blocks, and the association features are used for indicating association between semantics of at least two text blocks.
In a fifth aspect, embodiments of the present application provide an electronic device, including a processor coupled with at least one memory, the processor being configured to read a computer program stored in the at least one memory, so as to cause the electronic device to perform the method according to the first aspect, or so as to cause the electronic device to perform the method according to the second aspect.
In a sixth aspect, embodiments of the present application provide a computer-readable storage medium for storing a computer program or instructions, which when executed, cause a computer to perform the method according to the first aspect or cause an electronic device to perform the method according to the second aspect.
In a seventh aspect, embodiments of the present application provide a computer program product including instructions, which, when executed by a computer, cause the computer to implement the method of the above first aspect, or cause an electronic device to perform the method of the above second aspect.
In an eighth aspect, embodiments of the present application provide a chip, where the chip includes a processor and a communication interface, and the communication interface is, for example, an input/output interface, a pin, a circuit, or the like. The processor is configured to read instructions to perform the method performed by the electronic device of the first or second aspect.
Drawings
FIGS. 1a and 1b are schematic diagrams of two application scenarios of a retrieval system in an embodiment of the present application;
FIG. 2 is a flowchart illustrating steps of an embodiment of a method for retrieving a web page according to the present invention;
FIG. 3a is a schematic diagram of a first topology in an embodiment of the present application;
FIG. 3b is a schematic diagram of a second topology in an embodiment of the present application;
FIG. 3c is a schematic diagram illustrating a target node screened from a plurality of nodes in an embodiment of the present application;
FIG. 4a is a schematic diagram illustrating positions of at least two target text blocks in a webpage text to which the target text blocks belong according to an embodiment of the present application;
FIG. 4b is a schematic diagram of an application scenario for generating semantic vectors of web page texts in the embodiment of the present application;
FIG. 5 is a schematic diagram illustrating semantic similarity between semantic vectors of web page texts and semantic vectors of query semantics according to an embodiment of the present application;
FIG. 6 is a flowchart illustrating steps of another embodiment of a method for retrieving a web page according to an embodiment of the present application;
fig. 7 is a schematic structural diagram of a web page retrieval device according to an embodiment of the present application;
FIG. 8 is a schematic structural diagram of another web page retrieval apparatus according to an embodiment of the present application;
fig. 9 is a schematic structural diagram of an electronic device in an embodiment of the present application;
fig. 10 is a schematic structural diagram of a terminal in an embodiment of the present application.
Detailed Description
Some optional features in the embodiments of the present application may be implemented independently without depending on other features in some scenarios, such as a currently-based scheme, to solve corresponding technical problems and achieve corresponding effects, or may be combined with other features according to requirements in some scenarios. Accordingly, the apparatuses provided in the embodiments of the present application may also implement these features or functions, which are not described herein again.
In the description of the present application, "plurality" means two or more than two unless otherwise specified. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of the singular or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or multiple.
In addition, in order to facilitate clear description of technical solutions of the embodiments of the present application, in the embodiments of the present application, terms such as "first", "second", and the like are used to distinguish the same items or similar items having substantially the same functions and actions. Those skilled in the art will appreciate that the terms "first," "second," and the like do not denote any order or quantity, nor do the terms "first," "second," and the like denote any order or importance. Also, in the embodiments of the present application, words such as "exemplary" or "for example" are used to mean serving as examples, illustrations or illustrations. Any embodiment or design described herein as "exemplary" or "e.g.," is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, use of the word "exemplary" or "such as" is intended to present relevant concepts in a concrete fashion for ease of understanding.
To better understand the present application, the words referred to in the present application are first exemplified.
Semantic matching: matching is performed based on semantic similarity of the text. For example, when searching "how to learn python", the "how to learn" semantic representation of "how to learn" can be searched for content containing "how to learn python".
Deep semantic search: the search engine is no longer limited to using character matching to search relevant documents, but calculates semantic representation of text implication through a deep semantic model, and uses semantic vectors to search semantic space to find out web page text meeting search requirements of users.
The depth representation model: based on a deep-learning representation model, the main function of a deep representation model is to obtain a vector representation (or simply "vector") of the text.
The language model is as follows: the system is also called a statistical language model and is used for counting the probability of continuous occurrence of different words in human language, so that a machine can conveniently judge whether a section of input text conforms to human language habits or not or which words frequently occur together.
Subject matter: the theme of a text is the central idea of the text, which generally refers to the main content of the text. Generally, a text may have a general theme, and then a plurality of sub themes (topic) may be expressed under the general theme, and the theme described in the embodiment of the present application may be the general theme of the web page text, or the sub themes of the web page text, and is not limited specifically. For example, the general theme of web page text is "three-dimensional reconstruction technique", wherein the first sub-theme is "SFM technique", the second sub-theme is "application of three-dimensional reconstruction technique", and so on.
Sentence: refers to a continuous string of characters that includes a word, phrase (or phrase). There is a large pause between sentences and the end of a sentence has a punctuation mark. For example, punctuation may be a period, question mark, ellipsis, exclamation mark, or the like. Optionally, in this embodiment, there may be a small pause between sentences, and the punctuation mark at the end of a sentence may also be a comma, a semicolon, or the like.
The embodiment of the application provides a webpage retrieval method, which is applied to a retrieval system. Referring to fig. 1a, fig. 1a is a first scenario of a retrieval system. The retrieval system comprises a webpage database 101, a webpage processing device 102, a webpage matching device 103 and a terminal 104, wherein the webpage database 101 is used for storing a large amount of webpage texts (also referred to as "webpages" for short or "webpage documents"), the webpage processing device 102 is used for segmenting a plurality of webpage texts in the webpage database 101 by taking sentences as units to obtain a plurality of text blocks, each webpage text corresponds to a plurality of text blocks, and for each text block, a plurality of target text blocks are screened out of the plurality of text blocks according to characteristic indexes of the text block. The terminal 104 is configured to receive a query (query) input by a user, and send the query to the web page matching device 103. The web page matching device 103 is configured to receive a query statement from the terminal 104, obtain a plurality of second semantic vectors from the web page processing device 102, match the first semantic vector and the plurality of second semantic vectors of the query statement, and output a target web page text matched with the query statement, where the plurality of second semantic vectors are semantic vectors corresponding to a plurality of target text blocks screened from the plurality of web page texts. The web page database 101, the web page processing device 102 and the web page matching device 103 may all be servers, wherein the process executed by the web page processing device 102 may be an offline processing process, and the steps executed by the web page matching device 103 may be an online processing process. Optionally, please refer to fig. 1b, in which fig. 1b is a second schematic view of the search system. The web page database, the web page processing device and the web page processing device may also be integrally disposed in the electronic device 105, the electronic device 105 is configured to execute the functions executed by the web page database, the web page processing device and the web page matching device, and the web page database, the web page processing device and the web page matching device are used as functional modules in the electronic device for executing related functions. The electronic device may be a server, for example.
In the embodiment of the application, the electronic device (or the webpage processing device) firstly performs variable length segmentation on each webpage text by taking sentences as units, so that the completeness of the expression semantics of the text blocks is ensured, the electronic device dynamically screens out the target text blocks from the plurality of text blocks according to the characteristic indexes of the text blocks, and the target text blocks are used as retrieval indexes, so that the number of the retrieval indexes is greatly reduced, the resource overhead in the retrieval process is saved, meanwhile, the semantic completeness of the target text blocks is reserved, and the quality of retrieving the target webpage text is improved.
Referring to fig. 2, an embodiment of the present application provides a web page retrieval method, where the web page retrieval method mainly includes two processes, i.e., an offline process and a retrieval index construction stage, as shown in the following steps 201 to 203, an execution subject of the steps 204 to 205 may be a web page processing device in the first scenario, or may also be an electronic device in the other scenario; 2. in the online process, the web page matching stage, as shown in steps 204-205, the execution subject of steps 204-205 may be the web page matching device in the first scenario, or may also be the electronic device in another scenario. In the embodiment of the present application, the main body of the webpage retrieval method is exemplified by the electronic device.
Step 201, the electronic device segments a plurality of web page texts in units of sentences to obtain a plurality of text blocks. Each web page text corresponds to a plurality of text blocks.
Firstly, a webpage text is segmented by taking sentences as units to obtain a plurality of sentences.
Then, the sentences are combined to obtain a plurality of text blocks. Each text block contains at least one sentence, which may also be understood as a collection of sentences. Illustratively, the text block may be represented as the following formula (1).
c i =(s 1 ,s 2 ,…,s N ) The formula (1).
Wherein, c i Representing the ith text block; s N And representing the Nth sentence, wherein N is a positive integer and can be specifically set according to practical application. Wherein, the positions of the two adjacent sentences in the above formula (1) in the web page text are not limited, such as s 1 ,s 2 The text of the web page may be two adjacent sentences or two non-adjacent sentences.
Alternatively, in order to avoid that one text block represents a plurality of topics (topic), resulting in a reduction in the accuracy of the search, the maximum length of the text block may be defined so that each text block may express one topic. The length of the text block is less than or equal to a first threshold. Specifically, in practical applications, the first threshold may be set according to an empirical value, that is, the total character length of the N sentences does not exceed the limit of the first threshold.
Optionally, in order to enable the screened target text blocks to have a relationship in the subsequent step, overlapping sentences may exist in two adjacent text blocks in the constructed plurality of text blocks. For example, the ith text block c i And the (i + 1) th text block c i+1 At least one overlapping (or overlapping) sentence, that is, the text block c i Including sentence a, text block c i+1 May also contain the sentence a.
In this embodiment, the web page text is segmented in units of sentences, and compared with the prior art in which fixed-length characters (for example, 10 characters) are segmented on the web page text, the problems of key information loss and the like caused by keyword separation can be avoided. A plurality of sentences form a text block, and each text block can express a complete meaning, so that the search quality is improved.
Step 202, the electronic equipment screens out a plurality of target text blocks from the plurality of text blocks according to the characteristic indexes of the plurality of text blocks; the feature index is used for indicating an association feature between the plurality of text blocks or a semantic feature of the association feature and the text block in the webpage text, wherein the association feature is used for indicating an association between semantics of at least two text blocks.
Illustratively, semantic features include importance, which indicates how important a block of text plays a role in expressing the subject matter of the web page text to which it belongs. The association feature comprises a repetition degree which is used for indicating the repetition degree of the theme of the webpage text to which the two text blocks with the association relationship express.
Optionally, in order to more intuitively represent the text blocks and the association relationship between the text blocks, a topological structure of the text blocks may be established first, and the text blocks and the association relationship between the text blocks may be represented in a topological structure form. For example, the topology may be a graph structure, or the topology may be an adjacency matrix, which is not limited in particular. In this embodiment, the topology may be exemplarily illustrated by taking a graph structure as an example. One web page text corresponds to one topology. In order to distinguish between the topology before the screening and the topology after the screening, the structure before the screening is referred to as a "first topology", the structure after the screening is referred to as a "second topology", and the text block corresponding to the second topology is referred to as a "target text block".
First, please refer to fig. 3a, wherein fig. 3a is a schematic diagram of a first topology. The electronic equipment constructs a first topological structure according to the text blocks and the incidence relation among the text blocks, the first topological structure comprises a plurality of nodes and a plurality of edges, the nodes are used for indicating the text blocks, and the edges between the two nodes are used for indicating the incidence relation between the two text blocks. Illustratively, the first topology includes node a (e.g., corresponding to text block a), node B (e.g., corresponding to text block B), node C (e.g., corresponding to text block C), …, and node G (e.g., corresponding to text block G). The node A is connected with the node B, namely the text block A and the text block B have an incidence relation; node B is connected to node C, i.e. text block B and text block C have an association relationship, etc. Optionally, the association relationship may include at least one of the following three cases. 1. Containing the same anchor text or entity information. The anchor text is also called anchor text link, and is a form of link. The anchor text actually establishes a link relationship between the keyword and a Uniform Resource Locator (URL). 2. There is overlapping text information. For example, there is the same sentence a, or keyword b, etc. in both text blocks. 3. There is a context structure. For example, there is a sentence a in the text block B, a sentence C in the text block C, the sentence a expressing the above information, and the sentence C expressing the below information.
Next, please refer to fig. 3b, wherein fig. 3b is a schematic diagram of a second topology. And the electronic equipment screens out a plurality of target nodes from the plurality of nodes according to the importance and the repeatability to obtain a second topological structure. The second topological structure is used for indicating a plurality of target text blocks and the incidence relation among the target text blocks, and the target nodes correspond to the target text blocks. As shown in fig. 3B, for the web page text 1, the node a, the node B, the node C, the node …, and the node G are screened out to obtain a second topology structure. Node A, node B and node E are target nodes. Optionally, the text block B corresponding to the node B and the text block E corresponding to the node E may have overlapping sentences, and the node B and the node E are connected to obtain the second topology.
The following description specifically exemplifies that the electronic device screens out a plurality of target nodes from a plurality of nodes according to importance and repetition, as described in the following first implementation manner and second implementation manner.
In a first implementation, the electronic device may screen out a plurality of target text blocks from the plurality of text blocks according to the degree of repetition. In this embodiment, the semantic similarity and the mutual information of the two text blocks can be used to measure the repetition degree of the two text blocks. The higher the semantic similarity between two text blocks, the more relevant the semantic level between the two text blocks. The higher the mutual information between two text blocks, the higher the degree to which the two text blocks are dependent on each other, and the lower the mutual information, the higher the degree to which the two text blocks are independent of each other. The repetition degree is jointly measured through the semantic similarity and the mutual information between the two text blocks, the repetition degree is jointly measured from two levels, wherein the semantic similarity relates to the semantic level, the mutual information relates to the association degree level of the two text blocks, and the mutual information can find the condition that words commonly appear from the statistical angle to analyze whether semantic correlation or topic correlation exists between the two text blocks, so that the accuracy of measuring the repetition degree of the two text blocks is improved, and the quality of constructing a retrieval index is improved.
For one webpage text (such as webpage text 1), the plurality of text blocks in the webpage text 1 include a first text block and a second text block, wherein the first text block and the second text block are any two text blocks in the plurality of text blocks having an association relationship. As in the following expression (2), the first text block is the ith text block, and the second text block is the jth text block.
1) And semantic similarity used for indicating the correlation of semantic level between the two text blocks. The higher the semantic similarity between two text blocks, the more relevant the semantic level between the two text blocks. The semantic similarity (also referred to as "second semantic similarity") between two text blocks is shown in the following equation (2).
sim(c i ,c j )=cosine(C i ,C j ) The formula (2).
Wherein, c i Representing the ith text block, c j Represents the jth text block; c i Is c i Vector representation of (2), C j Is c j Is represented by a vector of (a). C i And C j Is obtained by a depth representation model. In the above formula (2), cosine (C) i ,C j ) Is represented by C i And C j Cos value of the angle between the two vectors.
2) And mutual information used for indicating the degree of interdependence between the two text blocks. The higher the mutual information between two text blocks, the higher the degree to which the two text blocks are dependent on each other, and the lower the mutual information, the higher the degree to which the two text blocks are independent of each other. The mutual information of the two text blocks is shown in the following formula (3).
Figure BDA0003270724390000101
Wherein K represents the number of text blocks; c. C i Representing the ith text block, c j Represents the jth text block;
Figure BDA0003270724390000102
for a set of K text blocks obtained by segmenting a webpage text (such as a webpage text 1), taking a value from 1 to an integer value from 1 to K; for each text block in the set, P (c) i ) Representing a text block c i P (c) is the edge probability density function of j ) Representing a text block c j The edge probability density function of (a); p (c) i ,c j ) Representing a text block c i And text block c j The joint probability density function of (a).
The electronic device may illustrate a method of filtering a plurality of target text blocks from a plurality of text blocks according to a degree of repetition. As shown in the following formula (4).
Figure BDA0003270724390000103
Where S represents a set of selected text blocks, λ 2 And, λ 3 Is a weight coefficient; r represents the set of all text blocks obtained by segmenting the web page text (such as web page text 1),
Figure BDA0003270724390000111
k represents the number of all text blocks; c. C i C represents by the epsilon R \ S i Belongs to set R but not to set S; sim (c) i ,c j ) Representing the ith text block c i And j-th text block c j Semantic similarity of (2), I (c) i ,c j ) Representing the ith text block c i And j-th text block c j The mutual information of (2). Total is a constant that represents the total score (e.g., 100). argmax represents the maximum output value.
Illustratively, please refer to FIG. 3a for understanding. The process of dynamically screening the target text block is exemplarily described with reference to the above equation (4), and it can be understood that the above S represents that the text block that needs to be left temporarily is selected in the dynamic screening process. The process of dynamic screening is exemplified by the following 3 calculations.
1 st calculation: taking the text block a as the initially selected text block as an example, at this time, S only contains one text block a, and is substituted into the above equation (4), where the text block a is c j The text block B is c i And (3) obtaining a score of 1 according to the formula (4), wherein the text block A only has a correlation with the text block B, and the score of 1 is the maximum output score, selecting the text block B, and putting the text block B into S, wherein S comprises the text block A and the text block B.
The 2 nd calculation: since the text block B has an association with the text block a, the text block B, and the text block D, respectively, it is then necessary to select which (or which) text block is put into S from the text block a, the text block C, and the text block D. Taking the text block B as c j The text block A, the text block C and the text block D are respectively used as C i The calculation was performed by substituting the above formula (4), and 3 scores were obtained. Example (b)For example, see table 1 below for 3 scores.
TABLE 1
c j c i Score value
Text block B Text block A Aa
Text block B Text block C Bb
Text block B Text block D Cc
And selecting the text block corresponding to the maximum score in the 3 scores shown in the table 1 as the selected text block. There are 6 cases as follows for selecting which text block(s).
Case 1: if the score Aa is the maximum value, the text block included in S is temporarily unchanged, and S includes a text block a and a text block B.
Case 2: if the score Bb is the maximum value, the text block C is selected to be placed in S, but the text block A is deleted from S, and the S comprises the text block B and the text block C.
Case 3: if the score Cc is the maximum value, the text block D is selected to be placed in S, but the text block A is deleted from S, and the S comprises the text block B and the text block D.
Case 4: and if the scores Aa and Bb are maximum values, selecting the text block C and putting the selected text block C into S, wherein the S comprises a text block A, a text block B and a text block C.
Case 5: and if the scores Aa and Cc are maximum values, selecting a text block D and putting the selected text block D into S, wherein the S comprises a text block A, a text block B and a text block D.
Case 6: and if the scores Aa, bb and Cc are maximum values, selecting the text block C and the text block D and putting the selected text block C and text block D into S, wherein the S comprises a text block A, a text block B, a text block C and a text block D.
As can be seen from the above cases 2 and 3, the text block selected in the S set is dynamically changed, and the last selected text block A may be deleted from S again after the following calculation result.
The 3 rd calculation, the text block D is regarded as c j Respectively taking the text block B and the text block E as c j And 2 scores are calculated according to the formula (4), and then the text block corresponding to the maximum score in the 2 scores is selected as the selected text block. The principle of the 3 rd calculation is the same as that of the 2 nd calculation, and a detailed description thereof is omitted, please refer to the 2 nd calculation for understanding. Until the last calculation, the text blocks (such as text block a, text block B, and text block E) included in the S set are the target text blocks.
In a second implementation manner, the electronic device may screen out a plurality of target text blocks from the plurality of text blocks according to two indexes, namely, the importance degree and the repetition degree. The second implementation manner is different from the first implementation manner in that an index of "importance" is added in the process of screening the target text block, that is, the target text block is not screened only by the index of "repetition". For example, if the text block a and the text block B have a high repetition degree, and the importance of the text block a is high and the importance of the text block B is low, it is possible to select the text block a as the target text block in the case where one of the two text blocks is selected as the target text block. In the implementation mode, the redundancy of the retrieval indexes caused by the fact that at least two text blocks express one theme can be reduced on the basis of not losing important information in each webpage text, and the number of the retrieval indexes is reduced.
1. The electronic device determines the importance of the text block for a specific exemplary illustration. The importance of a text block is related to the retrieval frequency of a hot-spot sentence, the semantic similarity between the text block and the hot-spot sentence (also referred to as "third semantic similarity"), and the existence probability of the text block.
First, the electronic device obtains a hot statement from a hot statement library or a diary file. The hot spot retrieval statement is a retrieval statement frequently used by a user, the hot spot statement is a retrieval statement with retrieval frequency higher than a preset value, and the higher the query frequency of the retrieval statement is, the higher the heat degree of the retrieval statement is, so that the easier the retrieval statement is used by the user.
The electronic device then calculates a third semantic similarity of the text block and the hotspot statement. The third semantic similarity is shown in the following formula (5).
sim(c i ,q p )=cosine(C i ,Q P ) Formula (5).
Wherein, c i Representing the ith block of text, q p Represents a p-th hotspot statement, C i Vector representation, Q, representing the ith text block p A vector representation representing the p-th hotspot statement; c i And Q p Is obtained by a depth representation model; cosine (C) i ,Q P Is represented by C i And Q p Cos value of the angle between the two vectors.
And finally, the electronic equipment calculates the importance of the text block according to the retrieval frequency, the third semantic similarity and the existence probability of the hot spot sentences. The semantic features further comprise existence probabilities, and the existence probabilities are used for indicating semantic repetition degrees of the text blocks in the webpage text. For example, the text block F is composed of the last sentence of the last natural segment and the first sentence of the next natural segment in the web page text, and in general, such text blocks cannot express a complete semantic topic, and have a low contribution rate to the whole web page text expression topic, and the text block F may repeatedly express the semantics expressed by the text block G in the last natural segment and the text block E in the next natural segment, and the deletion of the text block F does not affect the semantic expression of the whole web page text.
The language model can judge which words (or sentences) often appear together or whether the text accords with the habit of human language, obviously, the semantics of the text block formed by the upper and lower sentences are discontinuous, the words in the two sentences belong to words which do not often appear together, if the language model can judge that the existence probability of a certain text block is lower than a threshold value, the text block is possibly formed by the upper and lower sentences, and conversely, if the existence probability of a text block is higher, the semantics of the sentences in the text block is continuous and possibly formed by the sentences in the same paragraph. The higher the probability of the text block being present, the lower the semantic repetition of the text block in the web page text. The purpose of calculating the existence probability of the text block can reduce the situation that the upper and lower sentences form one text block. The existence probability of the text block is calculated as the following equation (6).
Figure BDA0003270724390000131
Figure BDA0003270724390000132
Wherein, prob (c) i ) Representing the ith text block c i (ii) a probability of presence of; c i Is the ith text block c i Vector representation of (2), Z i Is to obtain the ith text block c by a nonlinear activation function i A new vector representation of (2); | | | represents the modulo length. In the above equation (6), a threshold truncation method is adopted as the final existence probability of the ith text block, if Z i The modulus is greater than or equal to δ (also referred to as a "second threshold"), text block c i Probability of existence Prob (c) i ) Is equal to | | Z i L; if Z is i The module length is less than delta, then the text block c i Probability of existence Prob (c) i ) Equal to 0.
The importance is shown by the following formula (7).
Figure BDA0003270724390000133
Wherein q is p A hot-spot statement is represented that,
Figure BDA0003270724390000134
representing a hotspot statement q p The frequency of queries; sim (c) i ,q p ) Representing the ith text block c i With the p-th hotspot statement q p Third semantic similarity of (c), prob (c) i ) Representing the ith text block c i The probability of existence of.
In this embodiment, the hot sentence is a historical search sentence frequently used by the user, and the user can more easily click a web page based on the search sentence, which indicates that the historical search sentence itself may be a keyword of the text of the web page or a sentence carrying the keyword. If the similarity between a text block and the hot spot sentence is high, it means that the similarity between the text block and the keyword of the web page text is high, and the existence probability of the text block is high (the semantic repetition degree of the text block in the web page text is low), which indicates that the text block is more important for expressing the theme of the web page text.
2. The electronic device is specifically exemplified to filter the text blocks according to the importance and the repetition degree of the text blocks. The plurality of text blocks are filtered according to the following equation (7).
Figure BDA0003270724390000135
The difference between equation (8) and equation (4) above is that equation (8) incorporates text block c i The importance of (4) is the importance of replacing total in the above expression (4) with a text block. In the formula (8), the selected text block c is combined on the basis of the importance degree of the text block j With other textBlock c i To filter the plurality of text blocks. Wherein, in the formula (8), S represents a set of selected text blocks, λ 2 ,λ 2 And, λ 3 Is a weight coefficient; r represents a set of all text blocks obtained by segmenting a web page text (e.g. web page text 1),
Figure BDA0003270724390000136
k represents the number of all text blocks; c. C i C represents by the epsilon R \ S i Belongs to set R but not to set S; sim (c) i ,c j ) Representing the ith text block c i And j-th text block c j Semantic similarity of (2), I (c) i ,c j ) Representing the ith text block c i And j-th text block c j Mutual information of (2); argmax represents the maximum output value.
Please refer to the above exemplary description of the process for filtering a plurality of text blocks according to equation (4), which is not repeated herein.
In one application scenario, please understand with reference to fig. 3c, the process of filtering may include 4 steps.
S10, a webpage text is cut into a plurality of text blocks, a first topological structure is constructed according to the text blocks and the incidence relation among the text blocks, and the first topological structure comprises a plurality of nodes (node A, node B, node C, … and node G) and edges among the nodes. Each text block has a corresponding node (e.g., text block a corresponds to node a), etc.
S11, each text block is input to a language model, and the existence probability of each node is calculated by the language model (as shown in equation 6 above). Nodes with a probability of presence below a second threshold (e.g., node F) are calculated. Node F may be filtered out first.
And S12, calculating the importance of each node according to the formula (7). In fig. 3c, the size of a node schematically represents the size of the importance of the text block corresponding to the node.
And S13, screening out a target text block from the plurality of text blocks according to the formula (8). It is understood that, in the seven nodes shown in fig. 3C, the repetition degree between the text block a and the text block B is high, and the repetition degree between the text block B and the text block C is high, that is, the three nodes may express a topic (topic 1), and the importance degree of the node a is high, then the text block a and the text block B are selected as the target text block. And in the three text blocks of the text block D, the text block E and the text block G, the repetition degree between the text block D and the text block E is high, and the repetition degree between the text block E and the text block G is high, that is, the three text blocks of the text block D, the text block E and the text block G repeatedly express another theme (topic 2), but the importance degree of the text block E is high, and the importance degrees of the text block D and the text block G are low, and the text block E is selected as the target text block from among the three text blocks of the text block D, the text block E and the text block G.
And finally, the electronic equipment acquires the semantic vector of each target text block calculated by the depth representation model. The semantic vector of each target text block serves as a retrieval index. The depth representation model includes, but is not limited to, a Bidirectional Encoder Representation (BERT) model based on a transformer, an enhanced representation by knowledge integration (ERNIE) model based on knowledge integration, a Generative Pre-Training (GPT) model, and the like.
Step 203, the electronic device saves the target text blocks, and the target text blocks are used for retrieving the web page texts.
In the embodiment, in the process of constructing the retrieval index for the webpage texts in the webpage text library, the electronic equipment performs variable length segmentation on each webpage text by taking sentences as units, so that the completeness of the expression semantics of the text blocks is ensured, the electronic equipment dynamically screens out the target text blocks from the plurality of text blocks according to indexes such as the importance degree and the repeatability degree of the text blocks, and the target text blocks are used as the retrieval index, so that the number of the retrieval indexes is reduced, the resource overhead in the retrieval process is saved, meanwhile, the semantic completeness of the target text blocks is reserved, and the quality of retrieving the target webpage text is improved.
The steps 201 to 203 are processes for screening the target text block offline (may also be understood as processes for constructing a search index), and after the search index is constructed, the steps 201 to 203 may not be executed, but the following online web page matching processes, i.e., the following steps 204 and 205, are directly executed.
Step 204, the electronic device receives the query statement.
The terminal (such as a mobile phone or a computer) receives a query (query) input by a user. The electronic device receives the query sentence input by the user from the terminal, for example, the query sentence is "how to learn C language".
The electronic device calculates a semantic vector of a query sentence by using the depth representation model, and in order to distinguish the semantic vector of the query sentence from the semantic vector of the target text block, the semantic vector of the query sentence is referred to as a "first semantic vector", and the semantic vector of the target text block is referred to as a "second semantic vector".
Step 205, the electronic device matches the first semantic vector with a plurality of second semantic vectors to output a target webpage text matched with the query statement, where the first semantic vector is a semantic vector of the query statement, and the plurality of second semantic vectors are semantic vectors corresponding to a plurality of target text blocks screened from the plurality of webpage texts.
And S20, the electronic equipment determines a plurality of semantic similarities (also called as "first semantic similarity"), wherein each first semantic similarity indicates the similarity between a first semantic vector of the query statement and one second semantic vector of the plurality of second semantic vectors, and the plurality of first semantic similarities correspond to different second semantic vectors, and the plurality of second semantic vectors correspond to a plurality of target text blocks of the plurality of web page text segmentations. For example, the electronic device may match the first semantic vector and the plurality of second semantic vectors by using an approximate nearest neighbor Algorithm (ANN) to obtain a candidate set of text blocks, where the candidate set of text blocks includes a plurality of target text blocks, and a target text block in the candidate set of text blocks may also be referred to as a "recalled target text block". If the candidate set of text blocks includes 100 target text blocks, that is, the semantics of the 100 target text blocks have similarity to the semantics of the query sentence. The target text blocks in the text candidate set are ranked in order of high-to-low semantic similarity to the query sentence. Illustratively, the rankings of the plurality of target text blocks are shown in Table 2 below.
TABLE 2
Target text block Affiliated web page Semantic similarity Rank order
1# target text block a Web page text 1 (URL 1) 0.87 1
2# target text Block c Web page text 2 (URL 2) 0.83 2
1# target text block b Web page text 1 (URL 1) 0.78 3
3# target text Block a Web page text 3 (URL 3) 0.75 4
2# target text Block a Web page text 2 (URL 2) 0.68 5
1# target text Block e Web page text 1 (URL 1) 0.57 6
3# target text Block b Web page text 3 (URL 3) 0.45 7
For example, please refer to table 2 above, for convenience of description, a target text block a of the web page text 1 is denoted as "1# target text block a", a target text block c of the web page text 2 is denoted as "2# target text block c", a target text block b of the web page text 1 is denoted as "1# target text block b", and the like, which are not repeated herein for illustration. For example, the semantic similarity of the query sentence (how to learn C language) to the "1# target text block a" is 0.87, the semantic similarity of the query sentence to the "2# target text block C" is 0.83, and so on. The rank order is used to indicate the rank of the first semantic similarity corresponding to the target text block, e.g., "1# target text block a" at position 1, "2# target text block c" at position 2, and so on.
And S21, the electronic equipment outputs the target webpage text according to the at least one sorting order and the at least one relative position. The relative position is used for indicating the spacing distance of the positions of at least two target text blocks in the webpage text. The target webpage text is the webpage text output to the terminal.
Outputting the target webpage text based on a plurality of fourth semantic similarities, each fourth semantic similarity indicating a similarity of the first semantic vector and one of a plurality of third semantic vectors, wherein each third semantic vector is a corresponding semantic vector of the plurality of webpage texts, and each third semantic vector is related to a semantic vector corresponding to a plurality of target text blocks in the corresponding webpage text, positions of the plurality of target text blocks in the corresponding webpage text, and rank of the plurality of target text blocks in the corresponding webpage text.
Illustratively, the electronic device obtains location information for each target text block, wherein the location information is used to determine the relative location. For example, the location information may be a paragraph location, e.g., a first natural segment, a second natural segment, etc. For another example, the position information may be a line position, and a first line position of each target text block is taken as a line position where a target text block is located, or a last line position of each target text block is taken as a line position where a target text block is located. In this embodiment, the position information is described by taking the position of a paragraph as an example.
For example, referring to fig. 4a, the electronic device locates the paragraph position of each recalled target text block in the text of the webpage. The "1# target text block a" in URL1 is located in the 1 st natural segment, "1# target text block b" is located in the 2 nd natural segment, and the 1# target text block e is located in the 3 rd natural segment, wherein the relative positions of the "1# target text block a" and the "1# target text block b" are 1, and the interval between the "1# target text block b" and the "1# target text block e" is "3". The "2# target text block c" in URL2 is located in the 1 st natural piece, "2# target text block a" is located in the 1 st natural piece, and the relative positions of the "2# target text block c" and the "2# target text block a" are "0". The "3# target text block a" in URL3 is located in the 5 th natural segment, "3# target text block b" is located in the 10 th natural segment, and the relative positions of the "3# target text block a" and the "3# target text block b" are "5".
Illustratively, please understand in conjunction with fig. 4b, the electronic device includes a document aggregator for aggregating the web page texts based on the target text blocks in the same web page text to obtain a semantic vector (also referred to as "third semantic vector representation" or "third semantic vector") of each web page text. Illustratively, for URL1, 3 target text blocks are corresponded, that is, 3 target text blocks belong to URL1, and in the document aggregation process, at the web page text level, URL1 corresponds to 3 semantic vectors, and similarly, URL2 corresponds to 2 semantic vectors, and URL3 corresponds to 2 semantic vectors. The number of semantic vectors corresponding to the two web page texts of URL2 and URL3 is smaller than the number of semantic vectors corresponding to URL 1. And performing padding (e.g., zero padding) operation (e.g., zero padding) on the semantic vectors, the ranking order and the paragraph positions corresponding to the two URLs URL2 and URL3, so that the dimension of the matrix corresponding to each web page text is the same in the process of calculating the third semantic vector of each web page text. The document aggregator inputs the ranking order, paragraph position and semantic vector of each target text block (such as the target text block shown in table 1 above) into the coding layer for coding, so as to obtain a vector representation (also referred to as "first vector") of the ranking order and paragraph position, and then fuses the first vector corresponding to each target text block and the second semantic vector of the target text block, so as to obtain a second vector of each target text block. And inputting each second vector into the pooling layer and the neural network layer, and fusing 3 second vectors corresponding to each webpage text by using the webpage text as granularity through the neural network layer to obtain semantic vector representation of each webpage text, such as URL1 semantic vector representation, URL2 semantic vector representation and URL3 semantic vector representation.
Illustratively, as will be understood in conjunction with FIG. 5, the electronic device calculates a semantic similarity (also referred to as a "fourth semantic similarity") of the first semantic vector of the query statement (query) and the third semantic vector of each web page text. For example, the fourth semantic similarity between the first semantic vector of query and the third semantic vector of URL1 is 0.73, the fourth semantic similarity between the first semantic vector of query and the third semantic vector of URL2 is 0.75, the fourth semantic similarity between the first semantic vector of query and the third semantic vector of URL3 is 0.65, etc. The electronic equipment ranks the recalled webpage texts (such as URL1, URL2 and URL 3) according to the fourth semantic similarity from high to low, and then outputs a preset number of target webpage texts according to the ranking. For example, if the preset number is 10, the electronic device sends the top 10 target web page texts to the terminal according to the ranking. In this embodiment, for convenience of description only, the number of target text blocks, the number of web page texts, and the preset number shown in table 1 are all exemplary, and are not limited to this application.
In this embodiment, in the process of performing deep semantic search by the electronic device, the similarity between the target text block and the query statement is related to, and the relative positions of at least two target text blocks in the same webpage text are also related to, and the size of the relative position can express whether semantics expressed by the at least two target text blocks are compact (or concentrated). It can be understood that, the more similar the semantics of the query sentence are, the smaller the relative position in the web page text is, the more similar the subject matter expressed by the web page text or the part of the content in the web page text may be to the semantics of the query sentence, and the more the web page text may satisfy the user requirement, thereby improving the retrieval quality. For example, "1# target text block a" of the 3 target text blocks corresponding to URL1 has the highest semantic similarity with query, but the relative positions of the 3 target text blocks corresponding to URL1 are "2" and "4", respectively, that is, the text block positions in URL1 that are semantically similar to query are not concentrated, URL2 is viewed reversely, although the rank order of the two target text blocks corresponding to URL2 is 2 nd and 5 th, the two text blocks are both in a natural segment and have the relative position of "0", the semantics expressed by the two text blocks are more concentrated, that is, the topic expressed by URL2 may be closest to query, and the semantic similarity between URL2 and query is higher than the semantic similarity between URL1 and query at the web page text level.
Optionally, referring to fig. 1b again, in another embodiment, the main difference between this embodiment and the embodiment corresponding to fig. 5 is that the online process is executed by a different entity. In this embodiment, the main body of the online process may be a terminal. The terminal 104 is configured to receive the query statement and then send the query statement to the electronic device 105 (e.g., a server). The electronic device 105 is configured to, after receiving the query statement from the terminal, match the first semantic vector and the plurality of second semantic vectors of the query statement, and then send a candidate text block set matched with the query statement to the terminal, where the candidate text block set includes a plurality of target text blocks, or where the candidate text block set includes a plurality of second semantic vectors of the plurality of target text blocks. The candidate text block set may further include a first semantic similarity corresponding to each target text block. Alternatively, the target text blocks may be ranked according to the first semantic similarity, and the first semantic similarity and the ranking thereof are shown in table 2 above. If the candidate text block set includes a plurality of target text blocks, the terminal needs to calculate the second semantic vector of each target text block in the candidate text block set by using the depth representation model, and in order to save the calculation resources of the terminal, the example is given by taking the second semantic vector of the candidate text block set including the plurality of target text blocks as an example. Referring to fig. 6, the terminal is configured to perform the following steps 601-603.
Step 601, the terminal receives a query statement input by a user.
Please refer to the description of step 204 in the embodiment corresponding to fig. 5, which is not described herein.
Step 602, the terminal receives a plurality of semantic similarities (also referred to as first semantic similarities) from the electronic device.
The terminal receives a candidate text block set from the electronic equipment, wherein the candidate text block set comprises a second semantic vector of each target text block in a plurality of target text blocks and a first semantic similarity corresponding to the second semantic vector. The first semantic similarity between each second semantic vector and the first semantic vector of the query statement, and the first semantic similarity and the rank are shown in table 2 above, please refer to the description of table 2 above, which is not repeated here.
And 603, the terminal outputs at least one target webpage text according to the at least one sorting order and the at least one relative position.
Each target text block also carries corresponding position information, the position information indicates the position of the target text block in the webpage text, and the position information is used for indicating the relative position.
The terminal acquires the position information and a plurality of second semantic vectors of each target text block, wherein the position information indicates the position of the target text block in the webpage text, and the position information is used for indicating the relative position; then determining a third semantic vector of the webpage text by using a document aggregator based on the second semantic vector, the ranking order and the position information of each target text block in a plurality of target text blocks in the same webpage text; then, determining a plurality of fourth semantic similarities, wherein each fourth semantic similarity indicates the similarity of the first semantic vector and one of the third semantic vectors; and finally, outputting at least one target webpage text based on the plurality of fourth semantic similarities.
Please refer to the description of step S21 in the embodiment corresponding to fig. 5, which is not repeated herein.
Referring to fig. 7, an embodiment of the present application provides a web page retrieval apparatus 700, which includes a processing module 701, a receiving module 702, and a storage module 703, and is configured to execute the method executed by the electronic device in the foregoing method embodiment.
The processing module 701 is configured to segment a plurality of web page texts by taking a sentence as a unit to obtain a plurality of text blocks;
the processing module 701 is further configured to screen out a plurality of target text blocks from the plurality of text blocks according to the feature indexes of the plurality of text blocks; the characteristic index is used for indicating the association characteristic among the text blocks or the association characteristic and the semantic characteristic of each text block, wherein the association characteristic is used for indicating the association among semantics;
a storage module 703 is configured to store the multiple target text blocks, where the multiple target text blocks are used to retrieve the multiple web page texts.
Alternatively, the functions of the receiving module 702 may be performed by a transceiver. Wherein the transceiver has a transmitting and/or receiving function. Optionally, the transceiver is replaced by a receiver and/or a transmitter.
Optionally, the receiving module 702 is a network interface. Optionally, the network interface is an input-output interface or a transceiver circuit.
Alternatively, the functions of the receiving module 702 may be performed by a processor.
Alternatively, the processing module 701 is a processor, which is a general-purpose processor or a special-purpose processor, etc. Optionally, the processor comprises a transceiving unit for implementing receiving and transmitting functions. For example, the transceiver unit is a transceiver circuit, or an interface circuit. The transceiver circuitry, interface or interface circuitry for implementing the receive and transmit functions is separately deployed, optionally integrated together. The transceiver circuit, the interface or the interface circuit are used for reading and writing codes or data, or the transceiver circuit, the interface or the interface circuit are used for transmitting or transmitting signals.
Further, the processing module 701 is configured to perform step 201, step 202, and step 205 in the method embodiment corresponding to fig. 5. The storage module 703 is configured to execute step 203 in the method embodiment corresponding to fig. 5. The receiving module 702 is configured to execute step 204 in the method embodiment corresponding to fig. 5.
In one implementation, the processing module 701 may be a processing device, and the functions of the processing device may be partially or wholly implemented by software.
Alternatively, the functions of the processing means may be partly or wholly implemented by software. At this time, the processing device may include a memory for storing the computer program and a processor for reading and executing the computer program stored in the memory to perform the corresponding processes and/or steps in any one of the method embodiments.
Alternatively, the processing means may comprise only a processor. The memory for storing the computer program is located outside the processing means and the processor is connected to the memory by means of circuits/wires for reading and executing the computer program stored in the memory.
Alternatively, the processing means may be one or more chips, or one or more integrated circuits.
Optionally, a receiving module 702, configured to receive a query statement;
the processing module 701 is further configured to match a first semantic vector with a plurality of second semantic vectors to output at least one target web text matched with the query statement, where the first semantic vector is a semantic vector of the query statement, the plurality of second semantic vectors are semantic vectors corresponding to the plurality of target text blocks, and the target web text is one of the plurality of web texts.
Optionally, the processing module 701 is further specifically configured to: determining a plurality of first semantic similarities, each first semantic similarity indicating a similarity of a first semantic vector of the query statement and one of a plurality of second semantic vectors, the plurality of first semantic similarities corresponding to different second semantic vectors; outputting at least one target webpage text according to at least one sorting order and at least one relative position; wherein each ranking order is used to indicate a ranking of a corresponding first semantic similarity of one of the plurality of target text blocks, and each relative position is used to indicate a distance of positions of at least two of the plurality of target text blocks in one web page text.
Optionally, the processing module is further specifically configured to: outputting the target webpage text based on a plurality of fourth semantic similarities, each of the fourth semantic similarities indicating a similarity of one of the first semantic vector and a plurality of third semantic vectors, wherein each of the third semantic vectors is a corresponding semantic vector of the plurality of webpage texts, each of the third semantic vectors is a semantic vector corresponding to a plurality of target text blocks in a corresponding webpage text, positions of the plurality of target text blocks in the corresponding webpage text, and rank of the plurality of target text blocks in the corresponding webpage text.
Optionally, the associated features include a degree of repetition; the repetition degree is used for indicating the repetition degree of the two text blocks with the incidence relation expressing the subjects of the webpage texts; the processing module 701 is further configured to screen out a plurality of target text blocks from the plurality of text blocks according to the repetition degree.
Optionally, the semantic features include importance, and the importance is used for indicating the importance degree of the text block for expressing the topic of the webpage text; the processing module 701 is further configured to screen out a plurality of target text blocks from the plurality of text blocks according to the repetition degree and the importance degree.
Optionally, the repetition degree is related to a second semantic similarity indicating a similarity degree of semantics of the two text blocks having an association relationship, and mutual information indicating a degree of interdependency between the two text blocks.
Optionally, the semantic features further include an existence probability, where the existence probability is used to indicate a repetition degree of the text block expressing semantics in the webpage text to which the text block belongs; the processing module 701 is further configured to obtain a hot statement, where the hot statement is a retrieval statement with a retrieval frequency higher than a preset value; the importance of the text block is related to the retrieval frequency of the hot spot sentences, the third semantic similarity of the text block and the hot spot sentences and the existence probability of the text block.
Optionally, the association relationship includes: there is at least one of the same anchor text or entity, there is overlapping textual information, and there is a context structure.
Referring to fig. 8, an embodiment of the present application provides a web page retrieval apparatus 800, which includes a processing module 802 and a receiving module 801, and is configured to execute the method executed by the electronic device in the method embodiment corresponding to fig. 5. Or, the apparatus is configured to execute the method performed by the terminal in the method embodiment corresponding to fig. 6.
A receiving module 801, configured to receive a query statement;
a processing module 802, configured to obtain multiple semantic similarities, where each semantic similarity indicates a similarity between a first semantic vector and one of multiple second semantic vectors, and the multiple first semantic similarities correspond to different second semantic vectors, where the first semantic vector is a semantic vector of the query statement, and the multiple second semantic vectors are semantic vectors corresponding to multiple target text blocks obtained by segmenting multiple web pages; the text block is obtained by segmenting a webpage text by taking a sentence as a unit;
the processing module 802 is further configured to output at least one target web page text according to at least one ranking order and at least one relative position, where each ranking order is used to indicate a ranking of semantic similarity corresponding to one of the target text blocks, and each relative position is used to indicate a distance between positions of at least two of the target text blocks in one web page text, and the target web page text is one of the web page texts.
In a possible design, when the apparatus is configured to perform the method performed by the terminal in the method embodiment corresponding to fig. 6, the receiving module 801 is configured to receive the plurality of semantic similarities and the plurality of second semantic vectors from the electronic device.
Further, the receiving module 801 is configured to perform step 601 and step 602 in the embodiment corresponding to fig. 6, and the processing module 802 is configured to perform step 603 in the embodiment corresponding to fig. 6.
Alternatively, the functions of the receiving module 801 may be performed by a transceiver. Wherein the transceiver has a transmitting and/or receiving function. Optionally, the transceiver is replaced by a receiver and/or a transmitter.
Optionally, the receiving module 801 is a network interface. Optionally, the network interface is an input-output interface or a transceiving circuit.
Alternatively, the functions of the receiving module 801 may be executed by a processor.
Alternatively, the processing module 802 is a processor, which is a general-purpose processor or a special-purpose processor, etc. Optionally, the processor comprises a transceiving unit for implementing receiving and transmitting functions. For example, the transceiving unit is a transceiving circuit, or an interface circuit. The transceiver circuitry, interface or interface circuitry for implementing the receive and transmit functions is separately deployed, optionally integrated together. The transceiver circuit, the interface or the interface circuit are used for reading and writing codes or data, or the transceiver circuit, the interface or the interface circuit are used for transmitting or transmitting signals.
Optionally, the processing module is further specifically configured to: outputting the target webpage text based on a plurality of fourth semantic similarities, each of the fourth semantic similarities indicating a similarity of one of the first semantic vector and a plurality of third semantic vectors, wherein each of the third semantic vectors is a corresponding semantic vector of the plurality of webpage texts, each of the third semantic vectors is a semantic vector corresponding to a plurality of target text blocks in a corresponding webpage text, positions of the plurality of target text blocks in the corresponding webpage text, and rank of the plurality of target text blocks in the corresponding webpage text.
Referring to fig. 9, an embodiment of the present application further provides an electronic device 900, where the electronic device is configured to execute the method executed by the electronic device in the method embodiment corresponding to fig. 5. Such as the electronic device may be a server. Fig. 9 is a schematic structural diagram of an electronic device that may have a large difference due to different configurations or performances according to an embodiment of the present application, and may include one or more processors 922, a memory 932, and one or more readable storage media 930 (e.g., one or more mass storage devices) storing an application 942 or data 944. Memory 932 and readable storage medium 930 may be, among other things, transient storage or persistent storage. The program stored on the readable storage medium 930 may include one or more modules (not shown), each of which may include a sequence of instructions for operating on the electronic device. Still further, the processor 922 may be arranged to communicate with the readable storage medium 930 to execute a series of instruction operations in the readable storage medium 930 on the electronic device 900.
The electronic device 900 may also include one or more power sources 926, one or more wired or wireless network interfaces 950, one or more input-output interfaces 958, and/or one or more operating systems 941.
Referring to fig. 10, an embodiment of the present application further provides a terminal 1000, where the terminal is configured to execute the method executed by the terminal in the embodiment of the method corresponding to fig. 6. The terminal is used for executing the functions executed by the terminal in the method embodiment corresponding to fig. 6, and the terminal may be a computer or a mobile phone. Terminal 1000 can include components such as processor 1001, memory 1002, input unit 1003, display unit 1004, communication unit 1006, and so forth. The memory 1002 may be used to store software programs and modules, and the processor 1001 executes various functional applications of the apparatus and data processing by operating the software programs and modules stored in the memory 1002. The memory 1002 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. The processor 1001 may be a processing device as mentioned in the corresponding embodiment of fig. 8. Alternatively, the processor 1001 includes, but is not limited to, various types of processors such as a Central Processing Unit (CPU), an image signal processor, and the like.
The input unit 1003 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the device. Specifically, the input unit 1003 may include a touch panel 1031. Touch panel 1031, also referred to as a touch screen, may collect touch operations by a user on or near the touch panel 1031 (e.g., operations by a user on or near the touch panel 1031 using any suitable object or attachment such as a finger, stylus, etc.). The input unit 1130 may include other input devices 1032 in addition to the touch panel 1031. In particular, other input devices 1032 may include, but are not limited to, one or more of a physical keyboard, mouse, joystick, or the like. In this embodiment, the input unit 1003 is configured to receive a search statement input by a user.
The display unit 1004 may be used to display various image information. The display unit 1004 may include a display panel 1041, and optionally, the display panel 1041 may be configured in the form of a liquid crystal display, an organic light emitting diode, or the like. In some embodiments, the touch panel 1031 may be integrated with the display panel 1041 to implement input and output functions of the device. In the present embodiment, the display unit 1004 is used to display a target text web page and the like.
A communication unit 1006, configured to establish a communication channel to communicate with an electronic device (e.g., a server). Information of a plurality of semantic similarities and a plurality of second semantic vectors (or a plurality of target text blocks) of a plurality of target text blocks are received from the electronic device. For example, the communication unit 1006 may include a wireless local area network module, a bluetooth module, a baseband module, a wifi communication module, and other communication modules. Alternatively, the various communication modules in the communication unit 1006 typically come in the form of integrated circuit chips and may be selectively combined, without necessarily including all communication modules and corresponding antenna groups.
In this embodiment, the processor is configured to read the computer program stored in the at least one memory, so that the electronic device executes the method steps executed by the electronic device in the method embodiment corresponding to fig. 5, or the electronic device executes the method steps executed by the terminal in the method embodiment corresponding to fig. 6.
It is understood that the processor in the embodiments of the present application may be an integrated circuit chip having signal processing capability. In implementation, the steps of the above method embodiments may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software. The processor may be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic device, or discrete hardware components.
An embodiment of the present application further provides a computer-readable storage medium, where a program is stored, and when the program is executed on a computer, the computer is enabled to execute the method performed by the electronic device in the method embodiment corresponding to fig. 5. Or, the computer is caused to execute the method executed by the terminal in the method embodiment corresponding to fig. 6.
Further provided in embodiments of the present application is a circuit system, where the circuit system includes a processing circuit configured to execute the method performed by the electronic device in the method embodiment corresponding to fig. 5. Or, the computer is caused to execute the method executed by the terminal in the method embodiment corresponding to fig. 6.
An embodiment of the present application further provides a computer program product, which when executed on a computer, causes the computer to execute the method performed by the electronic device in the method embodiment corresponding to fig. 5. Or, when executed on a computer, cause the computer to perform the method performed by the terminal in the method embodiment corresponding to fig. 6.
The above embodiments may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, the above-described embodiments may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded or executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains one or more collections of available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium. The semiconductor medium may be a Solid State Drive (SSD).
The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

Claims (27)

1. A method for retrieving a web page, comprising:
segmenting a plurality of webpage texts by taking sentences as units to obtain a plurality of text blocks;
screening a plurality of target text blocks from the plurality of text blocks according to feature indexes of the plurality of text blocks, wherein the feature indexes are used for indicating association features between the text blocks or semantic features of the association features and the text blocks, and the association features are used for indicating association between semantics of at least two text blocks;
saving the target text blocks, wherein the target text blocks are used for retrieving the webpage texts.
2. The method of claim 1, further comprising:
receiving a query statement;
and matching a first semantic vector with a plurality of second semantic vectors to output at least one target webpage text matched with the query statement, wherein the first semantic vector is a semantic vector of the query statement, the plurality of second semantic vectors are semantic vectors corresponding to the plurality of target text blocks, and the target webpage text is one webpage text in the plurality of webpage texts.
3. The method of claim 2, wherein matching the first semantic vector to a plurality of second semantic vectors to output target web page text that matches the query statement comprises:
determining a plurality of first semantic similarities, each of the first semantic similarities indicating a similarity of the first semantic vector and one of the plurality of second semantic vectors, the plurality of first semantic similarities corresponding to different second semantic vectors;
and outputting at least one target webpage text according to at least one sorting order and at least one relative position, wherein each sorting order is used for indicating the ranking of the corresponding first semantic similarity of one of the target text blocks, and each relative position is used for indicating the distance between the positions of at least two of the target text blocks in one webpage text.
4. The method of claim 3, wherein outputting the target web page text according to at least one ranking order and at least one relative position comprises:
outputting the target webpage text based on a plurality of fourth semantic similarities, each of the fourth semantic similarities indicating a similarity of one of the first semantic vector and a plurality of third semantic vectors, wherein each of the third semantic vectors is a corresponding semantic vector of the plurality of webpage texts, each of the third semantic vectors is a semantic vector corresponding to a plurality of target text blocks in a corresponding webpage text, positions of the plurality of target text blocks in the corresponding webpage text, and rank of the plurality of target text blocks in the corresponding webpage text.
5. The method according to claim 1, wherein the associated feature comprises a repetition degree, and the repetition degree is used for indicating a repetition degree of two text blocks having an associated relationship expressing a subject of the webpage text.
6. The method of claim 5, wherein the semantic features comprise importance indicating how important the text block plays in expressing the subject of the webpage text, and wherein the filtering out the target text blocks from the text blocks according to the feature indicators comprises:
and screening the target text blocks from the text blocks according to the repetition degree and the importance degree.
7. The method of claim 6, wherein the repetition degree is related to a second semantic similarity and mutual information between the two text blocks, the second semantic similarity indicates a similarity degree of semantics of the two text blocks having an association relationship, and the mutual information indicates a degree of interdependence between the two text blocks.
8. The method according to claim 6 or 7, wherein the semantic features further comprise an existence probability indicating a repetition degree of the text block expressing semantics in the webpage text, and the method further comprises:
acquiring a hot sentence, wherein the hot sentence is a retrieval sentence with retrieval frequency higher than a preset value;
the importance of the text block is related to the retrieval frequency of the hot spot sentence, the third semantic similarity of the text block and the hot spot sentence, and the existence probability of the text block.
9. The method according to any one of claims 6-8, wherein the association relationship comprises: there is at least one of the same anchor text or entity, there is overlapping text information, and there is a context structure.
10. A method for retrieving a web page, comprising:
receiving a query statement;
obtaining a plurality of semantic similarities, each semantic similarity indicating a similarity between a first semantic vector of the query statement and one of the second semantic vectors, where the first semantic vectors correspond to different second semantic vectors, the first semantic vectors being semantic vectors of the query statement, the second semantic vectors corresponding to a plurality of target text blocks of a plurality of web page text snippets, and each target text block indicates at least one sentence in the web page text;
and outputting at least one target webpage text according to at least one sorting order and at least one relative position, wherein each sorting order is used for indicating the ranking of one corresponding semantic similarity in the target text blocks, each relative position is used for indicating the distance of the positions of at least two of the target text blocks in one webpage text, and the target webpage text is one webpage text in the target webpage texts.
11. The method of claim 10, wherein outputting at least one target web page text according to at least one ranking order and at least one relative position comprises:
outputting the target webpage text based on a plurality of fourth semantic similarities, each of the fourth semantic similarities indicating a similarity of one of the first semantic vector and a plurality of third semantic vectors, wherein each of the third semantic vectors is a corresponding semantic vector of the plurality of webpage texts, each of the third semantic vectors is a semantic vector corresponding to a plurality of target text blocks in a corresponding webpage text, positions of the plurality of target text blocks in the corresponding webpage text, and rank of the plurality of target text blocks in the corresponding webpage text.
12. The method of claim 10, further comprising: segmenting a plurality of webpage texts by taking sentences as units to obtain a plurality of text blocks;
screening a plurality of target text blocks from a plurality of text blocks according to feature indexes of the text blocks, wherein the feature indexes are used for indicating association features between the text blocks or semantic features of the association features and the text blocks, and the association features are used for indicating association between semantics of at least two text blocks;
saving the target text blocks, wherein the target text blocks are used for retrieving the webpage texts.
13. A web page retrieval apparatus, comprising:
the processing module is used for segmenting the plurality of webpage texts by taking sentences as units to obtain a plurality of text blocks;
the processing module is further configured to screen a plurality of target text blocks from the plurality of text blocks according to the feature indexes of the plurality of text blocks; the feature indicator is used for indicating an association feature between the plurality of text blocks or a semantic feature of each of the plurality of text blocks, wherein the association feature is used for indicating an association between semantics of at least two text blocks;
and the storage module is used for storing the target text blocks, and the target text blocks are used for retrieving the webpage texts.
14. The apparatus of claim 13, further comprising:
a receiving module, configured to receive a query statement;
the processing module is further configured to match the first semantic vector with a plurality of second semantic vectors to output at least one target web page text matched with the query statement, where the first semantic vector is a semantic vector of the query statement, the second semantic vectors are semantic vectors corresponding to the target text blocks, and the target web page text is one of the web page texts.
15. The apparatus of claim 13, wherein the processing module is further specifically configured to:
determining a plurality of first semantic similarities, each of the first semantic similarities indicating a similarity of a first semantic vector and one of the plurality of second semantic vectors, the plurality of first semantic similarities corresponding to different second semantic vectors;
outputting at least one of the target web page texts according to at least one ranking order and at least one relative position, wherein each ranking order is used for indicating the ranking of the corresponding first semantic similarity of one of the target text blocks, and each relative position is used for indicating the distance of the positions of at least two of the target text blocks in one web page text.
16. The apparatus of claim 15, wherein the processing module is further specifically configured to:
outputting the target webpage text based on a plurality of fourth semantic similarities, each of the fourth semantic similarities indicating a similarity of one of the first semantic vector and a plurality of third semantic vectors, wherein each of the third semantic vectors is a corresponding semantic vector of the plurality of webpage texts, each of the third semantic vectors is a semantic vector corresponding to a plurality of target text blocks in a corresponding webpage text, positions of the plurality of target text blocks in the corresponding webpage text, and rank of the plurality of target text blocks in the corresponding webpage text.
17. The apparatus of claim 13, wherein the associated feature comprises a repetition degree indicating a repetition degree of two text blocks having an associated relationship to express a subject of the webpage text.
18. The apparatus of claim 17, wherein the semantic features comprise importance indicating how important the text block plays in expressing the subject matter of the web page text;
the processing module is further configured to screen the target text blocks from the text blocks according to the repetition degree and the importance degree.
19. The apparatus of claim 18, wherein the repetition degree is related to a second semantic similarity and mutual information between the two text blocks, the second semantic similarity indicates a similarity degree of semantics of the two text blocks having an association relationship, and the mutual information indicates a degree of interdependence between the two text blocks.
20. The apparatus according to claim 18 or 19, wherein the semantic features further comprise an existence probability indicating a repetition degree of the text block expressing semantics in the webpage text;
the processing module is further used for acquiring a hot spot statement, wherein the hot spot statement is a retrieval statement with retrieval frequency higher than a preset value;
the importance of the text block is related to the retrieval frequency of the hot spot statement, the third semantic similarity of the text block and the hot spot statement, and the existence probability of the text block.
21. The apparatus according to any of claims 18-20, wherein the association comprises: there is at least one of the same anchor text or entity, there is overlapping text information, and there is a context structure.
22. A web page retrieval apparatus, comprising:
a receiving module, configured to receive a query statement;
a processing module, configured to obtain a plurality of semantic similarities, where each semantic similarity indicates a similarity between a first semantic vector and one of the second semantic vectors, and the first semantic similarities correspond to different second semantic vectors, where the first semantic vector is a semantic vector of the query statement, the second semantic vectors correspond to a plurality of target text blocks segmented from a plurality of web pages, and each target text block indicates at least one sentence in the web pages;
the processing module is further configured to output at least one target web page text according to at least one ranking order and at least one relative position, where each ranking order is used to indicate a ranking of semantic similarity corresponding to one of the target text blocks, each relative position is used to indicate a distance between positions of at least two of the target text blocks in one web page text, and the target web page text is one of the web page texts.
23. The apparatus of claim 22, wherein the processing module is further specifically configured to: outputting the target webpage text based on a plurality of fourth semantic similarities, each of the fourth semantic similarities indicating a similarity of one of the first semantic vector and a plurality of third semantic vectors, wherein each of the third semantic vectors is a corresponding semantic vector of the plurality of webpage texts, each of the third semantic vectors is a semantic vector corresponding to a plurality of target text blocks in a corresponding webpage text, positions of the plurality of target text blocks in the corresponding webpage text, and rank of the plurality of target text blocks in the corresponding webpage text.
24. The apparatus of claim 22, wherein the processing module is further specifically configured to:
segmenting a plurality of webpage texts by taking sentences as units to obtain a plurality of text blocks;
and screening a plurality of target text blocks from the plurality of text blocks according to feature indexes of the plurality of text blocks, wherein the feature indexes are used for indicating association features between the text blocks or semantic features of the association features and the text blocks, and the association features are used for indicating association between semantics of at least two text blocks.
25. An electronic device, comprising a processor coupled with at least one memory, the processor to read a computer program stored by the at least one memory, cause the electronic device to perform the method of any of claims 1-9, or cause the electronic device to perform the method of any of claims 10-12.
26. A computer-readable storage medium storing a computer program or instructions which, when executed, cause a computer to perform the method of any of claims 1 to 9 or cause an electronic device to perform the method of any of claims 10 to 12.
27. A computer program product comprising instructions which, when executed by a computer, cause the computer to carry out the method of any one of claims 1 to 9 or cause an electronic device to carry out the method of any one of claims 10 to 12.
CN202111101017.2A 2021-09-18 2021-09-18 Webpage retrieval method and related equipment Pending CN115840845A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202111101017.2A CN115840845A (en) 2021-09-18 2021-09-18 Webpage retrieval method and related equipment
PCT/CN2022/118346 WO2023040808A1 (en) 2021-09-18 2022-09-13 Webpage retrieval method and related device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111101017.2A CN115840845A (en) 2021-09-18 2021-09-18 Webpage retrieval method and related equipment

Publications (1)

Publication Number Publication Date
CN115840845A true CN115840845A (en) 2023-03-24

Family

ID=85574270

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111101017.2A Pending CN115840845A (en) 2021-09-18 2021-09-18 Webpage retrieval method and related equipment

Country Status (2)

Country Link
CN (1) CN115840845A (en)
WO (1) WO2023040808A1 (en)

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101923556B (en) * 2010-02-09 2013-01-02 上海莱希信息科技有限公司 Method and device for searching webpages according to sentence serial numbers
CN103020164B (en) * 2012-11-26 2015-06-10 华北电力大学 Semantic search method based on multi-semantic analysis and personalized sequencing
CN104484379B (en) * 2014-12-09 2018-06-12 百度在线网络技术(北京)有限公司 Determine the method and apparatus of music property relationship and inquiry processing method and device
CN107256267B (en) * 2017-06-19 2020-07-24 北京百度网讯科技有限公司 Query method and device
CN108829719B (en) * 2018-05-07 2022-03-01 中国科学院合肥物质科学研究院 Non-fact question-answer selection method and system
CN109033166B (en) * 2018-06-20 2022-01-07 国家计算机网络与信息安全管理中心 Character attribute extraction training data set construction method

Also Published As

Publication number Publication date
WO2023040808A1 (en) 2023-03-23

Similar Documents

Publication Publication Date Title
US11573996B2 (en) System and method for hierarchically organizing documents based on document portions
US9069857B2 (en) Per-document index for semantic searching
US20180300315A1 (en) Systems and methods for document processing using machine learning
US10073840B2 (en) Unsupervised relation detection model training
US8051080B2 (en) Contextual ranking of keywords using click data
KR20180011254A (en) Web page training methods and devices, and search intent identification methods and devices
CN111753167B (en) Search processing method, device, computer equipment and medium
CN103838833A (en) Full-text retrieval system based on semantic analysis of relevant words
JP2005302043A (en) Reinforced clustering of multi-type data object for search term suggestion
CN108647322B (en) Method for identifying similarity of mass Web text information based on word network
EP2577521A2 (en) Detection of junk in search result ranking
CN111539209B (en) Method and apparatus for entity classification
CN108875065B (en) Indonesia news webpage recommendation method based on content
Ayral et al. An automated domain specific stop word generation method for natural language text classification
CN101650729A (en) Dynamic construction method for Web service component library and service search method thereof
Jeong et al. i-TagRanker: an efficient tag ranking system for image sharing and retrieval using the semantic relationships between tags
CN108536665A (en) A kind of method and device of determining sentence consistency
CN115248839A (en) Knowledge system-based long text retrieval method and device
CN111651675A (en) UCL-based user interest topic mining method and device
CN111859079B (en) Information searching method, device, computer equipment and storage medium
Xu et al. Building spatial temporal relation graph of concepts pair using web repository
CN115840845A (en) Webpage retrieval method and related equipment
Antunes et al. Semantic features for context organization
KR20170102262A (en) Weighted subsymbolic data encoding
Rajkumar et al. Users’ click and bookmark based personalization using modified agglomerative clustering for web search engine

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication