CN114818727A - Key sentence extraction method and device - Google Patents

Key sentence extraction method and device Download PDF

Info

Publication number
CN114818727A
CN114818727A CN202210412327.4A CN202210412327A CN114818727A CN 114818727 A CN114818727 A CN 114818727A CN 202210412327 A CN202210412327 A CN 202210412327A CN 114818727 A CN114818727 A CN 114818727A
Authority
CN
China
Prior art keywords
key
target
key sentence
document
sentences
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210412327.4A
Other languages
Chinese (zh)
Inventor
王得贤
李长亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Kingsoft Digital Entertainment Co Ltd
Original Assignee
Beijing Kingsoft Digital Entertainment Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Kingsoft Digital Entertainment Co Ltd filed Critical Beijing Kingsoft Digital Entertainment Co Ltd
Priority to CN202210412327.4A priority Critical patent/CN114818727A/en
Publication of CN114818727A publication Critical patent/CN114818727A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a key sentence extraction method and a key sentence extraction device, wherein the key sentence extraction method comprises the following steps: acquiring a target document, and extracting a keyword and a first key sentence set based on the text content of the target document; extracting first semantic features of the keywords and second semantic features of text sentences in the target document, and determining a second key sentence set according to the first semantic features and the second semantic features; and determining a target key sentence set according to the first key sentence set and the second key sentence set. The method can effectively improve the accuracy and efficiency of extracting the key sentences.

Description

Key sentence extraction method and device
Technical Field
The application relates to the technical field of natural language processing, in particular to a key sentence extraction method. The application also relates to a key sentence extracting device, a computing device and a computer readable storage medium.
Background
With the development of artificial intelligence in the field of computer technology, the field of natural language processing has also been rapidly developed, and information retrieval according to texts is an important branch of the field of natural language processing. Artificial Intelligence (AI) refers to the ability of an engineered (i.e., designed and manufactured) system to perceive the environment, as well as the ability to acquire, process, apply, and represent knowledge. The development conditions of key technologies in the field of artificial intelligence comprise key technologies such as machine learning, knowledge maps, natural language processing, computer vision, human-computer interaction, biological feature recognition, virtual reality/augmented reality and the like. Natural Language Processing (NLP) is an important research direction in the field of computer science, which studies various theories and methods that enable efficient communication between a person and a computer using Natural Language. The concrete expression forms of natural language processing include machine translation, text summarization, text classification, text proofreading, information extraction, speech synthesis, speech recognition and the like. With the development of natural language processing technology and the acceleration of life rhythm, effective information needing to be transmitted to a user becomes shorter and shorter, and at the moment, a key sentence extraction technology in natural language processing can be adopted to extract key sentences so as to shorten the effective information.
For a long text document, the number of sentences is large, and the number of sentences is large along with the increase of the document, so that the key sentence is difficult to find, and the determined key sentence is inaccurate. In order to ensure the accuracy of the key sentences, the key sentences in the articles need to be searched manually at present. However, this method is very inefficient and requires a lot of manpower and material resources. Therefore, an effective solution to solve the above problems is needed.
Disclosure of Invention
In view of this, the present application provides a method for extracting key sentences to solve the technical defects in the prior art. The embodiment of the application also provides a key sentence extracting device, a computing device and a computer readable storage medium.
According to a first aspect of an embodiment of the present application, there is provided a key sentence extraction method, including:
acquiring a target document, and extracting keywords and a first key sentence set based on the text content of the target document;
extracting first semantic features of the keywords and second semantic features of text sentences in the target document, and determining a second key sentence set according to the first semantic features and the second semantic features;
and determining a target key sentence set according to the first key sentence set and the second key sentence set.
According to a second aspect of the embodiments of the present application, there is provided a key sentence extraction apparatus, including:
the first acquisition module is configured to acquire a target document and extract keywords and a first key sentence set based on the text content of the target document;
the first determining module is configured to extract first semantic features of the keywords and second semantic features of text sentences in the target document, and determine a second key sentence set according to the first semantic features and the second semantic features;
and the second determining module is configured to determine the target key sentence set according to the first key sentence set and the second key sentence set.
According to a third aspect of embodiments herein, there is provided a computing device comprising:
a memory and a processor;
the memory is used for storing computer-executable instructions, and the steps of the key sentence extraction method are realized when the processor executes the computer-executable instructions.
According to a fourth aspect of embodiments herein, there is provided a computer-readable storage medium storing computer-executable instructions that, when executed by a processor, implement the steps of a key sentence extraction method.
According to a fifth aspect of embodiments of the present application, there is provided a chip storing a computer program that, when executed by the chip, implements the steps of the key sentence extraction method.
The method for extracting the key sentences obtains a target document, and extracts key words and a first key sentence set based on the text content of the target document; extracting first semantic features of the keywords and second semantic features of text sentences in the target document, and determining a second key sentence set according to the first semantic features and the second semantic features; and determining a target key sentence set according to the first key sentence set and the second key sentence set. Determining a first key sentence set through the character content of the target document, and ensuring that the key sentences in the first key sentence set carry text level information; the second key sentence set is determined through the first semantic features of the keywords and the second semantic features of all text sentences in the target document, so that the key sentences can be determined from the semantic level more accurately, that is, the key sentences in the first key sentence set are ensured to carry semantic level information, and further, the target key sentence set is determined according to the first key sentence set and the second key sentence set, so that the key sentences in the target key sentences can contain both text level information and semantic level information, and the accuracy of determining the keywords is improved. In addition, based on the key sentence extraction method provided by the application, automatic extraction of the key sentences is realized, the accuracy of the key words is ensured, meanwhile, the key words are prevented from being extracted by spending a large amount of manpower and material resources, the efficiency of extracting the key words is improved, and the cost of extracting the key words is reduced.
Drawings
Fig. 1 is a schematic structural diagram of a key sentence extraction method according to an embodiment of the present application;
fig. 2 is a flowchart of a key sentence extraction method according to an embodiment of the present application;
fig. 3A is a schematic structural diagram of a similarity analysis model in a key sentence extraction method according to an embodiment of the present application;
fig. 3B is a flowchart illustrating a process of determining text similarity in a method for extracting key sentences according to an embodiment of the present application;
FIG. 4 is a flowchart illustrating a key sentence extraction method applied to a document recall according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of a key sentence extracting apparatus according to an embodiment of the present application;
fig. 6 is a block diagram of a computing device according to an embodiment of the present application.
Detailed Description
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is capable of implementation in many different ways than those herein set forth and of similar import by those skilled in the art without departing from the spirit of this application and is therefore not limited to the specific implementations disclosed below.
The terminology used in the one or more embodiments of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of the present application. As used in one or more embodiments of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present application refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It will be understood that, although the terms first, second, etc. may be used herein in one or more embodiments of the present application to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first aspect may be termed a second aspect, and, similarly, a second aspect may be termed a first aspect, without departing from the scope of one or more embodiments of the present application.
First, the noun terms to which one or more embodiments of the present invention relate are explained.
The TextRank algorithm is a common keyword extraction algorithm in the field of natural language processing, can be used for extracting keywords, phrases and key sentences and automatically generating text summaries, and is a graph-based sorting algorithm.
The TF-IDF (term frequency-inverse document frequency) algorithm is a commonly used weighting technique for information retrieval and data mining. TF is Term Frequency (Term Frequency) and IDF is Inverse text Frequency index (Inverse Document Frequency).
The LDA (latent dirichletAllocation) algorithm is a calculation method of Topic Models (Topic Models), and has no direct relation with word vectors.
The Latent Semantic Analysis (LSA) algorithm is mainly used for topic extraction of texts, mining meanings behind texts, data dimension reduction and the like.
In the present application, a key sentence extraction method is provided. The present application also relates to a key sentence extracting apparatus, a computing device, and a computer-readable storage medium, which are described in detail in the following embodiments one by one.
The execution main body of the key sentence extraction method provided by the embodiment of the present application may be a server or a terminal, which is not limited in the embodiment of the present application. The terminal may be any electronic product capable of performing human-Computer interaction with a user, such as a Personal Computer (PC), a mobile phone, a pocket PC, a tablet Computer, and the like. The server may be one server, a server cluster composed of multiple servers, or a cloud computing service center, which is not limited in this embodiment of the present application.
Referring to a schematic structural diagram of a key sentence extraction method shown in fig. 1, a target document is obtained first; then extracting keywords and a first key sentence set of the target document based on the text content of the target document; then, determining first semantic features of the keywords and second semantic features of text sentences in the target document; further, a second key sentence set is determined according to the first semantic features and the second semantic features; and finally, determining a target key sentence set of the target document according to the first key sentence set and the second key sentence set.
The method for extracting the key sentences obtains a target document, and extracts key words and a first key sentence set based on the text content of the target document; extracting first semantic features of the keywords and second semantic features of text sentences in the target document, and determining a second key sentence set according to the first semantic features and the second semantic features; and determining a target key sentence set according to the first key sentence set and the second key sentence set. Determining a first key sentence set through the character content of the target document, and ensuring that the key sentences in the first key sentence set carry text level information; the second key sentence set is determined through the first semantic features of the keywords and the second semantic features of all text sentences in the target document, so that the key sentences can be determined from the semantic level more accurately, that is, the key sentences in the first key sentence set are ensured to carry semantic level information, and further, the target key sentence set is determined according to the first key sentence set and the second key sentence set, so that the key sentences in the target key sentences can contain both text level information and semantic level information, and the accuracy rate of determining the key sentences is improved. In addition, based on the key sentence extraction method provided by the application, automatic extraction of the key sentences is realized, the key sentences are prevented from being extracted by spending a large amount of manpower and material resources while the accuracy rate of the key sentences is ensured, the efficiency of extracting the key sentences is improved, and the cost of extracting the key sentences is reduced.
Fig. 2 is a flowchart of a method for extracting a key sentence according to an embodiment of the present application, which specifically includes the following steps:
step 202: and acquiring a target document, and extracting keywords and a first key sentence set based on the text content of the target document.
The key sentence (set) extraction method and the key sentence extraction system have the core that the process of extracting key words is basically the same for texts in different fields or different categories, such as texts in the medical field, texts in the astronomy field, long texts and short texts, and the process of extracting key sentences is described in detail below.
Specifically, the text refers to the expression form of written language, usually a sentence or a combination of sentences with complete and systematic meaning, and a text can be a sentence, a paragraph or a chapter, all belonging to the text; a document refers to a file containing text; the target document refers to a document of which the keywords are to be extracted; text content is also the text content; keywords refer to words used to express the subject matter of the document; the key sentences refer to sentences used for expressing the theme content of the document; the key sentence set refers to a total or set of one or more key sentences; the first key sentence set is a key sentence set obtained from a text or a text level.
In practical applications, there are various ways to acquire a target document, for example, an operator may send an instruction for extracting a key sentence to an execution main body, or send an instruction for acquiring a target document, and accordingly, the execution main body starts to acquire the target document after receiving the instruction; the server may also automatically acquire the target document every preset time, for example, after the preset time, the server with the key sentence extraction function automatically acquires the target document; or after the preset time, the terminal with the key sentence extracting function automatically acquires the target document. The manner of acquiring the target document is not limited in any way in this specification.
In addition, the target document may be a document in any format, may be a document in doc (document) format, may be a document in txt format, may be a document in image format, and may also be a document in pdf (portable document format), which is not limited in this specification.
After the target document is obtained, the text content of the target document can be extracted: and selecting a corresponding text box extraction tool according to the format of the target document, and extracting a text box from the target document through the text box extraction tool, wherein the text box comprises characters forming character content or texts forming the text content. Therefore, the text box extracting tool corresponding to the format of the target document is selected to extract the text box, and accuracy and speed of extracting the text content can be improved.
For example, if the obtained target document is in a PDF format, selecting a pdfomer tool corresponding to the PDF format, and performing extraction operation on the target document, thereby extracting at least one text box containing text content in the target document to obtain the text content of the target document. If the obtained target document is in an image format, selecting an Optical Character Recognition (OCR) tool corresponding to the image format to extract the target document, so as to extract at least one text box containing text content in the target document, and obtaining the text content of the target document.
In one or more optional embodiments of the present description, after obtaining the text content of the target document, the keyword and the at least one keyword sentence extracted from the text content may be directly extracted by using a preset keyword sentence extraction tool, the obtained keyword is determined as the keyword of the target document, and the obtained at least one keyword sentence is determined as the first keyword sentence set of the target document. In this way, the efficiency of determining and extracting the keyword and the first key sentence set can be improved.
Step 204: and extracting first semantic features of the keywords and second semantic features of text sentences in the target document, and determining a second key sentence set according to the first semantic features and the second semantic features.
On the basis of obtaining the keywords and the first key sentence set based on the word content of the target document, further, determining a second key sentence set of the target document according to the first semantic features of the keywords and the second semantic features of the text sentences.
Specifically, the semantic features refer to features corresponding to the linguistic meanings of a plurality of word units; the first semantic features refer to the semantic features of the keywords; the second semantic features refer to semantic features of the text sentences; the second key sentence set refers to a key sentence set acquired from a semantic level.
In one or more optional embodiments of the present specification, after obtaining the keyword, a preset semantic feature extraction tool may be used to extract a first semantic feature of the keyword first, and then extract second semantic features of text statements in the target document, respectively, so that accuracy of determining to extract the first semantic feature and the second semantic feature may be improved.
Further, the first semantic features of the keywords are compared with the second semantic features of the text sentences, and a second key sentence set of the target document is determined from the text sentences.
Step 206: and determining a target key sentence set according to the first key sentence set and the second key sentence set.
And further, determining a target key sentence set of the target document according to the first key sentence set and the second key sentence set on the basis of determining the first key sentence set and the second key sentence set.
Specifically, the target key sentence set refers to a finally determined key sentence set of the target document, that is, a key sentence set including the target key sentence.
In a possible implementation manner of the embodiment of the present specification, after the first key sentence set and the second key sentence set are obtained, the first key sentence set and the second key sentence set may be subjected to union to obtain a key sentence set of the target document, that is, key sentences included in the first key sentence set and the second key sentence set are merged into one set to obtain a target key sentence set of the target document. Therefore, the completeness of the target key sentence set can be ensured, and the completeness and the accuracy of the key sentence extraction are also improved.
For example, the first key sentence set includes key sentence 1, key sentence 2, key sentence 3, and key sentence 4, the second key sentence set includes key sentence 2, key sentence 4, key sentence 5, and key sentence 6, and the target key sentence set includes key sentence 1, key sentence 2, key sentence 3, key sentence 4, key sentence 5, and key sentence 6.
In another possible implementation manner of the embodiment of the present specification, after the first key sentence set and the second key sentence set are obtained, intersection may be obtained for the first key sentence set and the second key sentence set to obtain a key sentence set of the target document, that is, the key sentences included in both the first key sentence set and the second key sentence set constitute the target key sentence set of the target document. Therefore, the accuracy of the target key sentence set can be ensured, and the extracted key sentences are ensured to have key information at a character content level and a semantic level.
Along the above example, the first key sentence set includes key sentence 1, key sentence 2, key sentence 3, and key sentence 4, the second key sentence set includes key sentence 2, key sentence 4, key sentence 5, and key sentence 6, and the target key sentence set includes key sentence 2 and key sentence 4.
The method for extracting the key sentences obtains a target document, and extracts key words and a first key sentence set based on the text content of the target document; extracting first semantic features of the keywords and second semantic features of text sentences in the target document, and determining a second key sentence set according to the first semantic features and the second semantic features; and determining a target key sentence set according to the first key sentence set and the second key sentence set. Determining a first key sentence set through the character content of the target document, and ensuring that the key sentences in the first key sentence set carry text level information; the second key sentence set is determined through the first semantic features of the keywords and the second semantic features of all text sentences in the target document, so that the key sentences can be determined from the semantic level more accurately, that is, the key sentences in the first key sentence set are ensured to carry semantic level information, and further, the target key sentence set is determined according to the first key sentence set and the second key sentence set, so that the key sentences in the target key sentences can contain both text level information and semantic level information, and the accuracy of determining the keywords is improved. In addition, based on the key sentence extraction method provided by the application, the automatic extraction of the key sentences is realized, the accuracy rate of the key words is ensured, the condition that a large amount of manpower and material resources are spent on extracting the key words is avoided, the efficiency of extracting the key words is improved, and the cost of extracting the key words is reduced.
In one or more optional embodiments of the present specification, a keyword and a first keyword sentence set of a target document may be directly extracted by a preset keyword sentence extraction tool; and a fourth key sentence set of the target document is determined according to the keywords and the target document, wherein the third key sentence set and the fourth key sentence set are collectively called as the first key sentence set. That is, under the condition that the first key sentence set includes the third key sentence set and the fourth key sentence set, the keyword and the first key sentence set are extracted based on the text content of the target document, and the specific implementation process may be as follows:
extracting keywords and a third key sentence set of the target document by utilizing an extraction algorithm based on the text content according to the text content of the target document;
and identifying a target text sentence containing the keyword in the target document according to the keyword, and constructing a fourth key sentence set of the target document based on the target text sentence.
Specifically, the third key sentence set is a key sentence set directly obtained from the text content based on the text content extraction algorithm; the fourth key sentence set is a key sentence set composed of key sentences containing keywords.
In practical application, after the text content of the target document is acquired, the obtained keywords are determined as the keywords of the target document and the obtained at least one key sentence is determined as the third key sentence set of the target document through the preset keyword sentence extraction tool and the keywords and the at least one key sentence extracted from the text content directly. Then, for any text sentence in the target document, checking whether the text sentence contains a keyword, if so, determining the text sentence as the target text sentence, and if not, determining the text sentence as a non-target text sentence. And traversing each text sentence in the target document, and forming a fourth key sentence set by all the obtained target text sentences. Therefore, the completeness of the first key sentence set can be improved, and the efficiency of determining the target key sentence set is further improved.
In one or more alternative embodiments of the present description, the keywords of the target document may be extracted through a term frequency-inverse text frequency index (TF-IDF) extraction algorithm. The keyword of the target document can also be extracted through a graph-based sorting algorithm, that is, the keyword of the target document is extracted by using the extraction algorithm based on the text content according to the text content of the target document, and the specific implementation process can be as follows:
performing word segmentation and word stop removal processing on the text content of the target document to obtain a plurality of candidate words;
constructing a word graph by taking each candidate word as a node and taking a co-occurrence relation among the candidate words as an edge according to a preset sliding window;
according to the word graph, iteratively calculating a first initial weight corresponding to each candidate word until a first preset convergence condition is reached to obtain a first target weight corresponding to each candidate word;
determining a keyword of the target document from each candidate word based on the first target weight.
Specifically, the word segmentation refers to a word segmentation process for matching character strings in the text content, and may be a forward maximum matching method, a reverse maximum matching method, a shortest path word segmentation method or a bidirectional maximum matching method, which is not limited in the present application; the stop word processing means deleting or filtering stop words in the words and phrases; the candidate words are all words obtained after word segmentation and word stop removal processing are carried out on the text contents; co-occurrence relationship refers to a co-occurrence relationship; the word graph is a graph formed by taking words, namely candidate words, as nodes; the first initial weight refers to the weight of a candidate word determined based on a word graph; the first target weight refers to a first initial weight that tends to stabilize or converge.
In practical application, word segmentation processing can be directly carried out on the text content of the target document to obtain a plurality of words; the text content can be segmented according to the whole sentence to obtain a plurality of sentences, and then each sentence is segmented to obtain a plurality of words.
Further, stop word processing is performed on the obtained multiple words, that is, stop words in the multiple words are removed, multiple candidate words are obtained, for example, part of speech tagging is performed on each word to determine part of speech of each word, and then, according to the part of speech of each word, words which are dummy words in the multiple words, that is, stop words are deleted, so that multiple candidate words are obtained.
And then, constructing a word graph according to the candidate words: and constructing an edge between any two nodes by taking each candidate word as a node and adopting a co-occurrence relationship, namely taking the co-occurrence relationship between each candidate word as the edge, wherein the edge exists between the two nodes, and K represents the size of a preset sliding window only when the candidate words corresponding to the candidate words co-occur in the preset sliding window with the length of K, namely K candidate words are co-occurring at most, wherein K is a positive integer, and if K is 2, a word graph is constructed.
And then, according to the connection relation between the nodes in the word graph and a preset weight calculation formula, as shown in formula 1, iteratively calculating the first initial weight corresponding to each candidate word. And determining the stabilized first initial weight as a first target weight corresponding to a certain candidate word until a first preset convergence condition is reached, if the first initial weight tends to be stable.
Figure BDA0003604466020000081
In the formula 1, V i Representing the ith candidate word; v j Representing the jth candidate word; s (V) i ) A first initial weight representing an ith candidate word; s (V) j ) A first initial weight representing a jth candidate word; d represents a damping coefficient, e.g. 0.85;In(V i ) Refers to a set of candidate words pointing to the ith candidate word; out (V) j ) Refers to a set consisting of candidate words pointed to by the jth candidate word; | Out (V) j ) I means Out (V) j ) The number of the cells.
On the basis of determining each first target weight, determining a keyword of the target document from candidate words with the first target weight larger than a first weight threshold; the candidate words can also be arranged according to the descending order of the first target weight, and the keywords of the target document are determined by the N candidate words before ranking, wherein N is a preset positive integer. Therefore, the efficiency and the accuracy of extracting the keywords can be improved.
For example, the TextRank algorithm can be used for firstly segmenting the text content according to the whole sentence to obtain a plurality of sentences, and then segmenting each sentence to obtain a plurality of words; and then, performing part-of-speech tagging on each word to determine the part-of-speech of each word, and deleting the stop words according to the part-of-speech of each word to obtain a plurality of candidate words. And then, constructing a word graph by taking the candidate words as nodes, taking the co-occurrence relation among the candidate words as edges and based on the size of a preset sliding window as 2. And then calculating the first initial weight of each candidate word by using the formula 1 until a first preset convergence condition is reached, and obtaining a first target weight corresponding to each candidate word. And finally, determining the 5 candidate words with the maximum first target weight as the keywords of the target document.
In one or more optional embodiments of the present description, the key sentences of the target document may be obtained according to semantic relevance between the sentences and the title of the target document; the third key sentence set can also be obtained by extracting the key sentences of the target document through a graph-based sorting algorithm, that is, the third key sentence set is extracted by utilizing the extraction algorithm based on the text content according to the text content of the target document, and the specific implementation process can be as follows:
performing sentence division processing on the text content of the target document to obtain a plurality of candidate sentences;
constructing a sentence graph by taking the candidate sentences as nodes and the sentence similarity among the candidate sentences as edges;
according to the sentence graph, iteratively calculating a second initial weight corresponding to each candidate sentence until a second preset convergence condition is reached to obtain a second target weight corresponding to each candidate word;
a third set of key sentences of the target document is determined from the candidate sentences based on the second target weight.
Specifically, sentence segmentation refers to a process of segmenting sentences in the text content; the candidate sentences are all sentences obtained after sentence division processing is carried out on the text content; sentence similarity refers to the similarity of sentence semantics; the sentence graph is a graph formed by taking sentences, namely candidate sentences as nodes; the second initial weight refers to the weight of the candidate sentence determined based on the sentence graph; the second target weight refers to a second initial weight that tends to stabilize or converge.
In practical application, the text content is firstly divided according to the whole sentence, that is, the sentence is divided, so as to obtain a plurality of candidate sentences. Then, a sentence graph is constructed according to the candidate sentences: and (4) taking each candidate sentence as a node, and constructing edges between the nodes according to the sentence similarity between the candidate sentences to obtain a sentence graph. And then, according to the connection relation between the nodes in the sentence graph and a preset weight calculation formula, as shown in formula 2, iteratively calculating a second initial weight corresponding to each candidate sentence. And determining the second initial weight after stabilization as a second target weight corresponding to a candidate sentence until a second preset convergence condition is reached, if the second initial weight tends to be stable. Further, the candidate sentences with the second target weight larger than the second weight threshold value can be determined as the key sentences of the target document; the candidate sentences can also be arranged according to the sequence of the first target weight from big to small, and the keywords of the target document are determined by the M candidate sentences before ranking, wherein M is a preset positive integer. All the obtained key sentences form a third key sentence set of the target document.
Therefore, the efficiency and the accuracy of extracting the third key sentence set can be improved.
Figure BDA0003604466020000091
In the formula 2, V i Representing the ith candidate sentence; v j Representing the jth candidate sentence; WS (V) i ) A second initial weight representing an ith candidate sentence; WS (V) j ) A second initial weight representing a jth candidate statement of a last iteration; d represents a damping coefficient, such as 0.85; in (V) i ) Refers to a set of candidate sentences pointing to the ith candidate sentence; out (V) j ) Refers to a set of candidate sentences pointed to by the jth candidate sentence; w ji Representing sentence similarity between the ith candidate sentence and the jth candidate sentence; w jk Representing the sentence similarity between the kth candidate sentence and the jth candidate sentence.
For example, the TextRank algorithm may be used to segment the text content according to the whole sentence to obtain a plurality of candidate sentences. And then, constructing a sentence graph by taking the candidate sentences as nodes and the sentence similarity among the candidate sentences as edges. And then calculating a second initial weight of each candidate sentence through an equation 2 until a second preset convergence condition is reached, and obtaining a second target weight corresponding to each candidate sentence. And finally, forming a third key sentence set of the target document by the 3 candidate words with the maximum second target weight.
In one or more alternative embodiments of the present specification, the first set of key sentences includes a third set of key sentences and a fourth set of key sentences, and when the target set of key sentences is determined based on the first set of key sentences and the second set of key sentences: the second key sentence set, the third key sentence set and the fourth key sentence set can be subjected to union set to obtain the target key sentence, that is, all the key sentences contained in the second key sentence set, the third key sentence set and the fourth key sentence set are combined into one set to obtain the target key sentence set of the target document. And intersection can be solved for the second key sentence set, the third key sentence set and the fourth key sentence set to obtain a target key sentence set.
It should be noted that, when intersection is solved for the second key sentence set, the third key sentence set and the fourth key sentence set to obtain the target key sentence set, the key sentences included in the second key sentence set, the third key sentence set and the fourth key sentence set can be combined into the target key sentence set of the target document; the target confidence of each initial key sentence can be obtained according to the initial confidence of each initial key sentence relative to the second key sentence set, the third key sentence set and the fourth key sentence set, and then the target key sentence set is determined according to the target confidence, that is, the intersection of the second key sentence set, the third key sentence set and the fourth key sentence set is solved to obtain the target key sentence set, and the specific implementation process can be as follows:
determining a first initial confidence degree of the initial key sentence relative to a second key sentence set, a second initial confidence degree relative to a third key sentence set and a third initial confidence degree relative to a fourth key sentence set, wherein the initial key sentence refers to any key sentence in the second key sentence set, the third key sentence set and the fourth key sentence;
determining a target confidence coefficient of the initial key sentence according to the first initial confidence coefficient, the second initial confidence coefficient and the third initial confidence coefficient;
determining a target key sentence from the second key sentence set, the third key sentence set and the fourth key sentence set based on the target confidence;
and constructing a target key sentence set based on the target key sentences.
In particular, confidence is also referred to as reliability, or confidence level, confidence coefficient; the first initial confidence degree refers to the confidence degree of an initial key sentence relative to a second key sentence set; the second initial confidence coefficient refers to the confidence coefficient of a certain initial key sentence relative to a third key sentence set; the third initial confidence coefficient refers to the confidence coefficient of a certain initial key sentence relative to the fourth key sentence set; the target confidence coefficient is a comprehensive confidence coefficient obtained after processing each initial confidence coefficient; the target key sentence is the minimum character unit forming the target key sentence set; the initial key sentence is the smallest character unit forming the second key sentence set, the third key sentence set and the fourth key sentence set.
In practical application, each initial key sentence has a corresponding initial confidence for the second key sentence set, the third key sentence set and the fourth key sentence set, and when some key sentence set in the second key sentence set, the third key sentence set and the fourth key sentence set does not contain some initial key sentence, the initial confidence of the initial key sentence relative to the key sentence set is a first preset value, for example, the second key sentence set does not contain the initial key sentence a, the first initial confidence of the initial key sentence a is 0; when a certain key sentence set in the second key sentence set, the third key sentence set and the fourth key sentence set contains a certain initial key sentence, the initial confidence of the initial key sentence relative to the key sentence set is a second preset value, for example, if the third key sentence set contains the initial key sentence a, the first initial confidence of the initial key sentence a is 1.
And respectively corresponding the initial key sentences to the first initial confidence, the second initial confidence and the third initial confidence of the second key sentence set, the third key sentence set and the fourth key sentence set. And then inputting the first initial confidence coefficient, the second initial confidence coefficient and the third initial confidence coefficient into a preset calculation formula (shown in formula 3) for calculation to obtain a target confidence coefficient of the initial key sentence. Determining a target key sentence of the target document by using key sentences of which the target confidence degrees are greater than a confidence degree threshold value in the second key sentence set, the third key sentence set and the fourth key sentence; and the key sentences in the second key sentence set, the third key sentence set and the fourth key sentence set can be arranged according to the sequence of the target confidence degrees from large to small, and L key sentences before ranking are determined as target keywords of the target document, wherein L is a preset positive integer. All the obtained target key sentences form a target key sentence set of the target document. Therefore, the accuracy and efficiency of the target key sentence can be improved.
y-a 1 x1+ a2 x2+ a3 x3 (formula 3)
In formula 3, y is the target confidence, x1, x2, and x3 are the first initial confidence, the second initial confidence, and the third initial confidence, respectively, and a1, a2, and a3 are weights corresponding to the first initial confidence, the second initial confidence, and the third initial confidence, respectively.
In addition, when a certain key sentence set in the second key sentence set, the third key sentence set, and the fourth key sentence set includes a certain initial key sentence, the initial confidence of the initial key sentence with respect to the key sentence set may also be a weight corresponding to the initial key sentence when the key sentence set is determined: for example, when a third key sentence set is obtained by using a TextRank algorithm, the weight of each key sentence in the third key sentence set is obtained, and the weight is used as a first confidence coefficient of the key sentence; for another example, when a TextRank algorithm is used to obtain a keyword, the weight of the keyword is obtained, and then a fourth key sentence set is constructed based on target text sentences containing the keyword, at this time, each target text sentence in the fourth key sentence set, that is, the weight of the key sentence is the sum of the weights of the keywords contained in the key sentence; for another example, in the second key sentence set determined according to the semantic relevance between the first semantic feature and each second semantic feature, the weight of the key sentence is the semantic relevance corresponding to the key sentence.
In one or more optional embodiments of the present specification, the key sentence extraction method may be used for document recall, that is, the target document may be a query document, or a candidate document, or both. In a case where the target document includes the query document and a plurality of candidate documents, after determining the target set of key sentences according to the first set of key sentences and the second set of key sentences, the method further includes:
and determining the text similarity between the query document and each candidate document according to the target key sentence set of the query document and the target key sentence sets of the candidate documents.
Specifically, querying a document refers to a document that a user inputs for retrieval; the candidate documents refer to documents stored in a database; text similarity refers to the degree of similarity of the textual content of the query document to the candidate documents.
In practical application, based on the key sentence extraction method, a target key sentence set of a query document and a target key sentence set of each candidate document are obtained, and text similarity between the target key sentence set of the query document and the target key sentence sets of the candidate documents is calculated, that is, the text similarity between the query document and each candidate document is determined. Therefore, the text similarity between the documents is calculated based on the target key sentence set, and the accuracy and reliability of the obtained text similarity can be improved.
The target key sentence sets of the query document and the candidate documents may be converted into feature vectors according to a preset vector conversion algorithm, and then the similarity between the feature vector corresponding to the query document and the feature vector corresponding to each candidate document, that is, the text similarity between the query document and each candidate document, is calculated according to a preset similarity algorithm, such as an euclidean Distance algorithm, a Manhattan Distance algorithm, or a Minkowski Distance algorithm.
In one or more optional embodiments of the present specification, after determining the text similarity between the query document and each candidate document according to the target key sentence set of the query document and the target key sentence sets of the multiple candidate documents, the similar documents of the query document may be recalled from the multiple candidate documents according to the text similarities. For example, similar documents of the query document may be determined from candidate documents with text similarity greater than a text similarity threshold, and the similar documents may be arranged in order of text similarity from large to small, and the similar documents of the query document may be determined from candidate words with the top rank of Q, where Q is a preset positive integer. The similar documents are then fed back. Therefore, the similar documents are determined and recalled based on the text similarity, and the efficiency and the accuracy of recalling the similar documents can be improved.
It should be noted that the document recall is to return a similar document with high similarity to the query document according to the query document of the user. And the document length is relatively long, so that the method has important significance for the research of similarity calculation of long texts. The current method for calculating the similarity of the long text mainly comprises the following steps: the method is rough and simple, the similarity is calculated only on a character level, and a semantic level is omitted; the similarity determining method based on the traditional machine learning comprises the following steps: manually constructing text feature vectors by using a TF-IDF algorithm, an LSA algorithm, an LDA algorithm and other methods, and then obtaining text similarity by calculating cosine similarity, Euclidean distance and the like, wherein the method needs to manually construct features and cannot fully utilize text context semantic information; the text similarity method based on text interception deep learning comprises the following steps: because the text of the document is Long, the front part or the middle part of the document is usually intercepted as the text, and the text similarity calculation is carried out through a Long and Short Memory network (LSTM) model, a Convolutional Neural Network (CNN), a BERT (bidirectional encoder retrieval from transformer) model and the like.
Compared with a similarity determining method based on character texts, the method provided by the specification determines a second key sentence set through the first semantic features of the keywords and the second semantic features of all text sentences in the target document, effectively utilizes semantic information, and greatly improves the accuracy rate of determining the text similarity; compared with a similarity determination method based on traditional machine learning, the method does not need to manually construct features, determines a second key sentence set through the first semantic features of the keywords and the second semantic features of each text sentence in the target document, and fully utilizes text context semantic information; compared with a text similarity method based on text truncation deep learning, truncation is realized because key sentences of the document are not extracted, and text key information loss is not weakened, so that the accuracy of similar document recall is improved. The method can avoid complex feature extraction, such as vector features constructed by calculating TF-IDF, LDA, LSA and the like, is convenient and quick, and effectively improves the accuracy of similar document recall.
In one or more optional embodiments of the present specification, the target key sentence set of the query document and the target key sentence sets of the plurality of candidate documents may be further input to a similarity analysis model trained in advance, so as to obtain the text similarities between the query document and each candidate document. That is, before determining the text similarity between the query document and each candidate document according to the target key sentence set of the query document and the target key sentence sets of the candidate documents, the method further includes:
acquiring a pre-trained similarity analysis model, wherein the similarity analysis model is obtained by training based on a sample statement set carrying a similarity label;
determining the text similarity between the query document and each candidate document according to the target key sentence set of the query document and the target key sentence sets of the candidate documents, comprising the following steps:
and inputting the target key sentence set of the query document and the target key sentence sets of the candidate documents into the similarity analysis model to obtain the text similarity between the query document and each candidate document.
Specifically, the similarity analysis model refers to a pre-trained neural network model, such as a neural network model and a probabilistic neural network model, and also includes a BERT model, a Transformer model, a sensor-BERT model and the like; the sample sentence is a sentence used for training to obtain a similarity analysis model; a sample statement pair refers to a set containing two sample statements; the sample statement set refers to a set containing a plurality of sample statement pairs; the similarity label refers to the real text similarity of two sample sentences in the sample sentence pair.
In practical application, a similarity analysis model obtained by training based on a sample statement set carrying a similarity label can be obtained first. Then on the basis of acquiring a target key sentence set of the query document and target key sentence sets of a plurality of candidate documents, further inputting the target key sentence set of the query document and the target key sentence sets of the plurality of candidate documents into a similarity analysis model, carrying out similarity calculation on the target key sentence set of the query document and the target key sentence sets of the plurality of candidate documents by the similarity analysis model, and outputting text similarity of the target query document and each candidate document. Through a pre-trained similarity analysis model, text similarity between the query document and each candidate document is calculated based on a target key sentence set of the query document and target key sentence sets of a plurality of candidate documents, and the speed and accuracy for determining the text similarity can be improved.
In one or more alternative embodiments of the present description, the similarity analysis model includes a feature extraction layer and a pooling layer; at this time, the target key sentence set of the query document and the target key sentence sets of the plurality of candidate documents are input into the similarity analysis model to obtain the text similarity between the query document and each candidate document, and the specific implementation process can be as follows:
aiming at any candidate document, respectively inputting a target key sentence set of a query document and a target key sentence set of the candidate document to a feature extraction layer for feature extraction processing to obtain a query feature vector and a candidate feature vector;
inputting the query feature vector and the candidate feature vector into a pooling layer respectively for pooling processing to obtain a query embedded vector and a candidate embedded vector;
and determining the text similarity between the query document and the candidate document according to the query embedding vector and the candidate embedding vector.
Specifically, the feature extraction layer may be a neural network model, such as a bert (bidirectional Encoder retrieval from transformer) model; the Pooling layer is also called a down-sampling layer, namely a Pooling layer, and can compress the input entity feature vectors and the text feature vectors, so that on one hand, the features and parameters are reduced, the complexity of subsequent similarity calculation can be simplified, and on the other hand, certain invariance of the entity feature vectors and the text feature vectors is kept; the query feature vector is a hidden layer representation obtained by inputting a target key sentence set of a query document into a feature extraction layer for processing; the candidate feature vector is a hidden layer representation obtained after inputting a target key sentence set of a candidate document into a feature extraction layer for processing; the hidden layer representation is to abstract the characteristics of an input real target key sentence set to another dimension space to show the more abstract characteristics of the target key sentence set, and in addition, the hidden layer representation can better perform linear division; the pooling treatment is the treatment of removing the impurity information and keeping the key information; inquiring the embedded vector refers to vector representation obtained after pooling the inquired feature vector; the candidate embedding vector is a vector representation obtained by pooling candidate feature vectors.
In practical application, referring to fig. 3A, fig. 3A shows a schematic structural diagram of a similarity analysis model in a key sentence extraction method provided in an embodiment of the present application, where the similarity analysis model includes a feature extraction layer and a pooling layer. On the basis of obtaining a target key sentence set of a query document and target key sentence sets of a plurality of candidate documents, the target key sentence set of the query document and the target key sentence set of any candidate document can be respectively input into a feature extraction layer, and after feature extraction processing is respectively carried out on the target key sentence set of the query document and the target key sentence set of the candidate documents by the feature extraction layer, a query feature vector and a candidate feature vector are output. Then, in order to reduce the data processing amount, the query feature vector and the candidate feature vector are respectively input into a pooling layer for pooling, and after pooling is completed, the pooling layer outputs the query embedded vector and the candidate embedded vector. Further, the query embedding vector and the candidate embedding vector are compared, and the similarity of the query embedding vector and the candidate embedding vector is calculated, namely the text similarity between the query document and the candidate document. Therefore, the efficiency and the accuracy of determining the text similarity can be improved.
In one or more optional embodiments of the present description, in order to improve efficiency and accuracy of feature extraction performed by the feature extraction layer, two sub-feature extraction layers having the same structure, parameter type, and parameter number may be set in the feature extraction layer, that is, the feature extraction layer includes a first sub-feature extraction layer and a second sub-feature extraction layer having the same structure, parameter type, and parameter number, so that one of the sub-feature extraction layers may perform feature extraction on a target key sentence set of a query document, and the other sub-feature extraction layer may perform feature extraction on a target key sentence set of a candidate document. That is, under the condition that the feature extraction layer includes the first sub-feature extraction layer and the second sub-feature extraction layer having the same structure, the same parameter type, and the same parameter number, the target key sentence set of the query document and the target key sentence set of the candidate document are respectively input to the feature extraction layer for feature extraction processing, so as to obtain a query feature vector and a candidate feature vector, and the specific implementation process may be as follows:
inputting a target key sentence set of a query document into a first sub-feature extraction layer for feature extraction processing to obtain a query feature vector;
inputting the target key sentence set of the candidate document into a second sub-feature extraction layer for feature extraction processing to obtain candidate feature vectors.
In practical application, referring to fig. 3A, fig. 3A shows a schematic structural diagram of a similarity analysis model in a key sentence extraction method provided in an embodiment of the present application, where a feature extraction layer includes two sub-feature extraction layers: a first sub-feature extraction layer and a second sub-feature extraction layer. When feature extraction is performed on a target key sentence set of a query document and a target key sentence set of the candidate document, the target key sentence set of the query document needs to be input into a first sub-feature extraction layer, and after the first sub-feature extraction layer performs feature extraction on the target key sentence set of the query document, query feature vectors corresponding to the target key sentence set of the query document are output; and inputting the target key sentence set of the candidate document into a second sub-feature extraction layer, and outputting a candidate feature vector corresponding to the target key sentence set of the candidate document after the second sub-feature extraction layer performs feature extraction on the target key sentence set of the candidate document.
In order to improve the efficiency and the precision of the pooling processing of the pooling layer and further improve the efficiency of the similarity analysis model for determining the text similarity, two sub-pooling layers with the same structure, parameter type and parameter number can be arranged in the pooling layer, namely, the pooling layer comprises a first sub-pooling layer and a second sub-pooling layer with the same structure, parameter type and parameter number, so that one of the sub-pooling layers can perform the pooling processing on the query feature vector, and the other sub-pooling layer performs the pooling processing on the candidate feature vector. That is, under the condition that the pooling layer includes the first sub-pooling layer and the second sub-pooling layer having the same structure, the same parameter type and the same parameter number, the query feature vector and the candidate feature vector are respectively input to the pooling layer for pooling processing, so as to obtain a query embedding vector and a candidate embedding vector, and the specific implementation process may be as follows:
inputting the query feature vector into a first sub-pooling layer for pooling to obtain a query embedded vector;
and inputting the candidate feature vectors into a second sub-pooling layer for pooling to obtain candidate embedded vectors.
In practical application, referring to fig. 3A, fig. 3A shows a schematic structural diagram of a similarity analysis model in a key sentence extraction method provided in an embodiment of the present application, where a pooling layer includes two sub-pooling layers: a first sub-pooling layer and a second sub-pooling layer. When the query feature vectors and the candidate feature vectors are subjected to pooling processing, the query feature vectors are input into a first sub-pooling layer, and after the query feature vectors are subjected to pooling processing by the first sub-pooling layer, query embedded vectors corresponding to the query feature vectors are output; and inputting the candidate feature vectors into a second sub-pooling layer, and outputting candidate embedded vectors corresponding to the query candidate feature vectors after the second sub-pooling layer pools the candidate feature vectors.
Before the pre-trained similarity analysis model is obtained, the language characterization model needs to be trained to obtain the similarity analysis model. That is, before obtaining the pre-trained similarity analysis model, the method further includes:
acquiring a preset language characterization model and a sample set, wherein the sample set comprises a plurality of sample statement set pairs carrying similarity labels, and the sample statement set pairs comprise a first sample statement set and a second sample statement set;
extracting any sample statement set pair from the sample set, inputting a first sample statement set and a second sample statement set in the sample statement set pair to a language representation model, and obtaining the prediction similarity of the first sample statement set and the second sample statement set;
determining a loss value according to the prediction similarity and a similarity label carried by the sample statement pair;
and adjusting model parameters of the language characterization model according to the loss value, continuously executing the step of extracting any sample statement set pair from the sample set, and determining the trained language characterization model as a similarity analysis model under the condition of reaching a first preset training stop condition.
Specifically, the language characterization model refers to a pre-specified pre-trained neural network model, such as RoBERTa model; the first sample statement set and the second sample statement set are two sample statement sets contained in the sample statement set pair; the predicted similarity refers to the similarity between a first sample statement set and a second sample statement set determined by the language representation model; the first training stop condition may be that the loss value is less than or equal to a preset threshold, or that the number of iterative training times reaches a preset iterative value, or that the loss value converges, i.e., the loss value does not decrease with continued training.
In practical applications, there are various ways to obtain the language representation model and the sample set, for example, the operator may send a training instruction of the language representation model to the execution subject, or send an obtaining instruction of the language representation model and the sample set, and accordingly, the execution subject starts to obtain the language representation model and the sample set after receiving the instruction; or the server may automatically acquire the language representation model and the sample set every preset time, for example, after the preset time, the server with the model training function automatically acquires the language representation model and the sample set in the specified access area; or after the preset time, the terminal with the model training function automatically acquires the language representation model and the sample set which are stored locally. The manner in which the language characterization model and the sample set are obtained is not limited in any way in this specification.
After a language representation model and a sample set are obtained, a sample statement set pair is extracted from the sample set, a first sample statement set and a second sample statement set contained in the sample statement set pair are input into the language representation model, similarity calculation is carried out on the first sample statement set and the second sample statement set determined by the language representation model, and prediction similarity of the first sample statement set and the second sample statement set is output. Secondly, determining a loss value according to the similarity degree and a similarity degree label carried by the sample statement pair and a preset first loss function, adjusting model parameters of a language representation model according to the loss value under the condition that a first preset training stopping condition is not reached, then extracting one sample statement pair from the sample set again, and carrying out the next round of training; and under the condition that a first preset training stopping condition is reached, determining the trained language characterization model as a similarity analysis model. Therefore, the language representation model is trained through the plurality of sample statement sets, the accuracy and the speed of the similarity analysis model for determining the text similarity can be improved, and the robustness of the similarity analysis model is improved.
In one or more alternative embodiments of the present specification, after extracting the first semantic features of the keywords and the second semantic features of the text sentences in the target document, the second keyword sentence sets may be determined according to semantic relevance between the first semantic features and the second semantic features. That is, according to the first semantic features and each second semantic feature, the second key sentence set is determined, and the specific implementation process may be as follows:
determining semantic association degrees of the first semantic features and the second semantic features;
and determining a second key sentence set from each text sentence according to the semantic relevance.
Specifically, the semantic association degree refers to a similarity between the first semantic feature and the second semantic feature.
In practical applications, the similarity between the first semantic feature and each of the second semantic features, that is, the semantic association degree, may be calculated according to a preset similarity algorithm, such as an euclidean Distance (euclidean Distance) algorithm, a Manhattan Distance (Manhattan Distance) algorithm, or a Minkowski Distance (Minkowski Distance) algorithm. Adding a second key sentence set to the text sentences of which the semantic relevance is greater than the threshold value of the semantic relevance to obtain a second key sentence set; the text sentences can also be arranged according to the sequence of the semantic relevance from large to small, and the text sentences with the P top ranks are added to a second key sentence set to obtain the second key sentence set, wherein P is a preset positive integer. Therefore, the completeness of the second key sentence set can be improved, and the efficiency of determining the target key sentence set is further improved.
In one or more optional embodiments of the present specification, the keywords and each text sentence in the target document may also be input to a pre-trained association degree analysis model, so as to obtain a semantic association degree between the first semantic feature and each second semantic feature. That is, before extracting the first semantic features of the keywords and the second semantic features of the text sentences in the target document, the method further includes:
acquiring a pre-trained relevance analysis model, wherein the relevance analysis model comprises a feature extraction submodel and a relevance operator model;
extracting first semantic features of the keywords and second semantic features of text sentences in the target document, wherein the extracting comprises the following steps:
inputting the keywords and each text sentence in the target document into a feature extraction submodel to obtain first semantic features of the keywords and second semantic features of each text sentence;
determining semantic association degree of the first semantic features and each second semantic feature, including:
and inputting the first semantic features and the second semantic features into the relevance degree calculation operator model to obtain the semantic relevance degree of the first semantic features and the second semantic features.
Specifically, the relevance analysis model refers to a pre-trained neural network model, such as a neural network model and a probabilistic neural network model, and also includes a BERT model, a Transformer model, a sensor-BERT model and the like; the feature extraction sub-model is a part for extracting features of the keywords or the text sentences in the relevance analysis model; the relevance operator model refers to a part of the relevance analysis model for calculating the semantic relevance.
In practical application, the relevance analysis model including the feature extraction submodel and the relevance meter submodel may be obtained first. Then on the basis of obtaining the keywords and each text statement in the target document, further inputting the keywords and each text statement into a feature extraction submodel, and performing feature extraction on the keywords and each text statement by the feature extraction submodel to obtain a first semantic feature of the keywords and a second semantic feature of each text statement; and then inputting the first semantic features and the second semantic features into an association degree calculation operator model, and performing association degree calculation on the first semantic features and the second semantic features by the association degree calculation operator model to output semantic association degrees of the first semantic features and the second semantic features, namely semantic association degrees of the keywords and the text sentences. Through a pre-trained relevance analysis model, the semantic relevance between the keywords and each text statement in the target document is obtained based on the keywords and each text statement, and the speed and accuracy of determining the semantic relevance can be improved.
Before obtaining the pre-trained relevance analysis model, the neural network model needs to be trained to obtain the relevance analysis model. That is, before the pre-trained relevance analysis model is obtained, the method further includes:
acquiring a preset neural network model and a training set, wherein the neural network model comprises a feature extraction submodel and an association degree operator model, the training set comprises a plurality of sample pairs carrying association degree labels, and the sample pairs comprise sample words and sample sentences;
extracting any sample pair from the training set, inputting sample words and sample sentences in the sample pair into the feature extraction submodel, and obtaining first prediction features of the sample words and second prediction features of the sample sentences;
inputting the first prediction characteristic and the second prediction characteristic into an association degree operator model to obtain the prediction association degree of the first prediction characteristic and the second prediction characteristic;
determining a difference value according to the predicted relevance and the relevance label carried by the sample pair;
and adjusting model parameters of the feature extraction submodel and the relevance meter submodel according to the difference values, continuously executing the step of extracting any sample pair from the training set, and determining the trained neural network model as the relevance analysis model under the condition of reaching a second preset training stopping condition.
Specifically, the neural network model refers to a mathematical model of a neuron, such as a BERT model; the sample sentence is a sentence for training to obtain the relevance analysis model; the sample words refer to words of a correlation degree analysis model obtained by training; a sample pair refers to a set containing one sample sentence and one sample word; the training set refers to a set comprising a plurality of sample pairs; the relevance label refers to the real relevance of sample words and sample sentences in the sample pair; the first prediction characteristic refers to semantic characteristics of sample words determined by the characteristic extraction sub-model; the second prediction characteristic refers to semantic characteristics of a sample statement determined by the characteristic extraction submodel; the predicted relevance is the relevance of the first predicted characteristic and the second predicted characteristic determined by the relevance operator model; the second training stop condition may be that the loss value is less than or equal to a preset threshold, or that the number of iterative training times reaches a preset iterative value, or that the loss value converges, i.e., the loss value does not decrease with continued training.
In practical applications, there are various ways to acquire the neural network model and the training set, for example, the operator may send a training instruction of the neural network model to the execution subject, or send an acquisition instruction of the neural network model and the training set, and accordingly, the execution subject starts to acquire the neural network model and the training set after receiving the instruction; or the server may automatically acquire the neural network model and the training set every preset time, for example, after the preset time, the server with the model training function automatically acquires the neural network model and the training set in the specified access area; or after the preset time, the terminal with the model training function automatically acquires the neural network model and the training set which are stored locally. The manner in which the neural network model and the training set are obtained is not limited in any way.
After obtaining a neural network model and a sample set, extracting a sample pair from a training set, inputting sample words and sample sentences contained in the sample pair into a feature extraction sub-model, and determining feature extraction on the sample words and sample sentences, namely a first prediction feature and a second prediction feature of the sample sentences by the feature extraction sub-model; and then inputting the first prediction characteristic and the second prediction characteristic into an association degree calculation operator model, performing association degree calculation on the first prediction characteristic and the second prediction characteristic by the association degree calculation operator model, and outputting the prediction association degree of the first prediction characteristic and the second prediction characteristic. Secondly, determining a difference value according to the predicted relevance and a relevance label carried by the sample pair and a preset second loss function, adjusting model parameters of the neural network model according to the difference value under the condition that a second preset training stopping condition is not reached, and then extracting a sample pair from the training set again for the next round of training; and under the condition that a second preset training stopping condition is reached, determining the trained neural network model as a relevancy analysis model. Therefore, the neural network model is trained through the multiple samples, the accuracy and the speed of the relevance analysis model for determining the semantic relevance can be improved, and the robustness of the relevance analysis model is improved.
Referring to fig. 3B, fig. 3B shows a processing flow chart of determining text similarity in a key sentence extraction method provided in an embodiment of the present application, which is described by taking a query document and a candidate document as an example:
s1, first obtaining a first key sentence set of the query document and the candidate document keywords, wherein the first key sentence set comprises a third key sentence set and a fourth key sentence set.
S1-1, acquiring a third key sentence set of the query document and the candidate document keywords: respectively extracting key words and key sentences of the query document and the candidate document by using a textrank method, and respectively generating a third key sentence set of the query document and the candidate document;
s1-2, then obtaining a fourth key sentence set of the query document and the candidate document: and (4) correspondingly and respectively searching target text sentences containing the keywords in the query document and the candidate documents according to the keywords generated in the step (S1-1), and determining a fourth key sentence set of the query document and a fourth key sentence set of the candidate documents.
S2, inputting the keywords of the query document and each text sentence of the query document into a relevancy analysis model, and acquiring a first preset number of text sentences with the highest keyword semantic relevancy of the query document as a second key sentence set of the query document; and obtaining a second key sentence set of the candidate document in the same way.
S3, generating a target key sentence set: respectively solving the intersection of the second key sentence set, the third key sentence set and the fourth key sentence set of the query document to obtain a target key sentence set of the query document; and obtaining a target key sentence set of the candidate document in the same way.
S4, determining text similarity: and inputting the target key sentence set of the query document and the target key sentence set of the candidate document into a pre-trained similarity analysis model to obtain the text similarity of the query document and the candidate document.
The key sentence extraction method provided by the application comprises the steps of obtaining a target document, and extracting a keyword and a first key sentence set based on the text content of the target document; extracting first semantic features of the keywords and second semantic features of text sentences in the target document, and determining a second key sentence set according to the first semantic features and the second semantic features; and determining a target key sentence set according to the first key sentence set and the second key sentence set. Determining a first key sentence set through the character content of the target document, and ensuring that the key sentences in the first key sentence set carry text level information; the second key sentence set is determined through the first semantic features of the keywords and the second semantic features of all text sentences in the target document, so that the key sentences can be determined from the semantic level more accurately, that is, the key sentences in the first key sentence set are ensured to carry semantic level information, and further, the target key sentence set is determined according to the first key sentence set and the second key sentence set, so that the key sentences in the target key sentences can contain both text level information and semantic level information, and the accuracy rate of determining the key sentences is improved. In addition, based on the key sentence extraction method provided by the application, automatic extraction of the key sentences is realized, the key sentences are prevented from being extracted by spending a large amount of manpower and material resources while the accuracy rate of the key sentences is ensured, the efficiency of extracting the key sentences is improved, and the cost of extracting the key sentences is reduced.
The following will further describe the key sentence extraction method with reference to fig. 4 by taking the application of the key sentence extraction method provided by the present application to document recall as an example. Fig. 4 shows a processing flow chart of a key sentence extraction method applied to document recall according to an embodiment of the present application, which specifically includes the following steps:
step 402: a query document and a plurality of candidate documents are obtained.
Step 404: and aiming at the query document and any one document in the candidate documents, extracting the key words and the third key sentence sets of the document by utilizing an extraction algorithm based on the text content according to the text content of the document.
Optionally, extracting the keyword of the document by using a text content-based extraction algorithm according to the text content of the document, including:
performing word segmentation and word stop removal processing on the text content of the document to obtain a plurality of candidate words;
constructing a word graph by taking each candidate word as a node and taking a co-occurrence relation among the candidate words as an edge according to a preset sliding window;
according to the word graph, iteratively calculating a first initial weight corresponding to each candidate word until a first preset convergence condition is reached to obtain a first target weight corresponding to each candidate word;
determining a keyword for the document from each candidate word based on the first target weight.
Optionally, extracting a third set of key sentences of the document by using a text-based extraction algorithm according to the text content of the document, including:
sentence-dividing the text content of the document to obtain a plurality of candidate sentences;
constructing a sentence graph by taking the candidate sentences as nodes and the sentence similarity among the candidate sentences as edges;
according to the sentence graph, iteratively calculating a second initial weight corresponding to each candidate sentence until a second preset convergence condition is reached to obtain a second target weight corresponding to each candidate word;
a third set of key sentences for the document is determined from the candidate sentences based on the second target weight.
Step 406: and identifying a target text sentence containing the keyword in the document according to the keyword, and constructing a fourth key sentence set of the document based on the target text sentence.
Step 408: and acquiring a pre-trained relevance analysis model, wherein the relevance analysis model comprises a feature extraction submodel and a relevance operator model.
Optionally, before obtaining the pre-trained relevance analysis model, the method further includes:
acquiring a preset neural network model and a training set, wherein the neural network model comprises a feature extraction submodel and an association degree operator model, the training set comprises a plurality of sample pairs carrying association degree labels, and the sample pairs comprise sample words and sample sentences;
extracting any sample pair from the training set, inputting sample words and sample sentences in the sample pair into the feature extraction submodel, and obtaining first prediction features of the sample words and second prediction features of the sample sentences;
inputting the first prediction characteristic and the second prediction characteristic into an association degree operator model to obtain the prediction association degree of the first prediction characteristic and the second prediction characteristic;
determining a difference value according to the predicted relevance and the relevance label carried by the sample pair;
and adjusting model parameters of the feature extraction submodel and the relevance meter submodel according to the difference values, continuously executing the step of extracting any sample pair from the training set, and determining the trained neural network model as the relevance analysis model under the condition of reaching a second preset training stopping condition.
Step 410: and inputting the keywords and each text sentence in the document into the feature extraction submodel to obtain first semantic features of the keywords and second semantic features of each text sentence.
Step 412: and inputting the first semantic features and the second semantic features into the relevance degree calculation operator model to obtain the semantic relevance degree of the first semantic features and the second semantic features.
Step 414: and determining a second key sentence set from each text sentence according to the semantic relevance.
Step 416: and solving the intersection of the second key sentence set, the third key sentence set and the fourth key sentence set to obtain a target key sentence set of the document.
Optionally, the intersecting the second key sentence set, the third key sentence set, and the fourth key sentence set to obtain the target key sentence set of the document includes:
determining a first initial confidence degree of the initial key sentence relative to a second key sentence set, a second initial confidence degree relative to a third key sentence set and a third initial confidence degree relative to a fourth key sentence set, wherein the initial key sentence refers to any key sentence in the second key sentence set, the third key sentence set and the fourth key sentence;
determining a target confidence coefficient of the initial key sentence according to the first initial confidence coefficient, the second initial confidence coefficient and the third initial confidence coefficient;
determining a target key sentence from the second key sentence set, the third key sentence set and the fourth key sentence set based on the target confidence;
and constructing a target key sentence set of the document based on the target key sentences.
Step 418: and acquiring a pre-trained similarity analysis model, wherein the similarity analysis model is obtained by training based on a sample statement set carrying a similarity label.
Optionally, before obtaining the pre-trained similarity analysis model, the method further includes:
acquiring a preset language characterization model and a sample set, wherein the sample set comprises a plurality of sample statement set pairs carrying similarity labels, and the sample statement set pairs comprise a first sample statement set and a second sample statement set;
extracting any sample statement set pair from the sample set, inputting a first sample statement set and a second sample statement set in the sample statement set pair to a language representation model, and obtaining the prediction similarity of the first sample statement set and the second sample statement set;
determining a loss value according to the prediction similarity and a similarity label carried by the sample statement pair;
and adjusting model parameters of the language characterization model according to the loss value, continuously executing the step of extracting any sample statement set pair from the sample set, and determining the trained language characterization model as a similarity analysis model under the condition of reaching a first preset training stop condition.
Step 420: and inputting the target key sentence set of the query document and the target key sentence sets of the candidate documents into the similarity analysis model to obtain the text similarity between the query document and each candidate document.
Step 422: and recalling similar documents of the query document from the candidate documents according to the text similarity.
According to the key sentence extraction method, the first key sentence set is determined according to the text content of the target document, and the key sentences in the first key sentence set are ensured to carry text level information; the second key sentence set is determined through the first semantic features of the keywords and the second semantic features of all text sentences in the target document, so that the key sentences can be determined from the semantic level more accurately, that is, the key sentences in the first key sentence set are ensured to carry semantic level information, and further, the target key sentence set is determined according to the first key sentence set and the second key sentence set, so that the key sentences in the target key sentences can contain both text level information and semantic level information, and the accuracy rate of determining the key sentences is improved. In addition, based on the key sentence extraction method provided by the application, automatic extraction of the key sentences is realized, the key sentences are prevented from being extracted by spending a large amount of manpower and material resources while the accuracy rate of the key sentences is ensured, the efficiency of extracting the key sentences is improved, and the cost of extracting the key sentences is reduced.
In addition, compared with a similarity determining method based on character texts, the key sentence extracting method provided by the application determines a second key sentence set through the first semantic features of the keywords and the second semantic features of all text sentences in the target document, so that semantic information is effectively utilized, and the text similarity determining accuracy is greatly improved; compared with a similarity determination method based on traditional machine learning, the method does not need to manually construct features, determines a second key sentence set through the first semantic features of the keywords and the second semantic features of each text sentence in the target document, and fully utilizes text context semantic information; compared with a text similarity method based on text truncation deep learning, truncation is realized because key sentences of the document are not extracted, and text key information loss is not weakened, so that the accuracy of similar document recall is improved.
Corresponding to the above method embodiment, the present application further provides an embodiment of a key sentence extraction device, and fig. 5 shows a schematic structural diagram of the key sentence extraction device provided in the embodiment of the present application. As shown in fig. 5, the apparatus includes:
a first obtaining module 502, configured to obtain a target document, and extract a keyword and a first set of key sentences based on the text content of the target document;
a first determining module 504, configured to extract a first semantic feature of the keyword and a second semantic feature of each text sentence in the target document, and determine a second key sentence set according to the first semantic feature and each second semantic feature;
a second determining module 506 configured to determine a set of target key sentences according to the first set of key sentences and the second set of key sentences.
Optionally, the first set of key sentences includes a third set of key sentences and a fourth set of key sentences;
a first obtaining module 502, further configured to:
extracting keywords and a third key sentence set of the target document by utilizing an extraction algorithm based on the text content according to the text content of the target document;
and identifying a target text sentence containing the keyword in the target document according to the keyword, and constructing a fourth key sentence set of the target document based on the target text sentence.
Optionally, the first obtaining module 502 is further configured to:
performing word segmentation and word stop removal processing on the text content of the target document to obtain a plurality of candidate words;
constructing a word graph by taking each candidate word as a node and taking a co-occurrence relation among the candidate words as an edge according to a preset sliding window;
according to the word graph, iteratively calculating a first initial weight corresponding to each candidate word until a first preset convergence condition is reached to obtain a first target weight corresponding to each candidate word;
determining a keyword of the target document from each candidate word based on the first target weight.
Optionally, the first obtaining module 502 is further configured to:
performing sentence division processing on the text content of the target document to obtain a plurality of candidate sentences;
constructing a sentence graph by taking the candidate sentences as nodes and the sentence similarity among the candidate sentences as edges;
according to the sentence graph, iteratively calculating a second initial weight corresponding to each candidate sentence until a second preset convergence condition is reached to obtain a second target weight corresponding to each candidate word;
a third set of key sentences of the target document is determined from the candidate sentences based on the second target weight.
Optionally, the second determining module 506 is further configured to:
and solving the intersection of the second key sentence set, the third key sentence set and the fourth key sentence set to obtain a target key sentence set.
Optionally, the second determining module 506 is further configured to:
determining a first initial confidence degree of the initial key sentence relative to a second key sentence set, a second initial confidence degree relative to a third key sentence set and a third initial confidence degree relative to a fourth key sentence set, wherein the initial key sentence refers to any key sentence in the second key sentence set, the third key sentence set and the fourth key sentence;
determining a target confidence coefficient of the initial key sentence according to the first initial confidence coefficient, the second initial confidence coefficient and the third initial confidence coefficient;
determining a target key sentence from the second key sentence set, the third key sentence set and the fourth key sentence set based on the target confidence;
and constructing a target key sentence set based on the target key sentences.
Optionally, the target document comprises a query document and a plurality of candidate documents;
optionally, the apparatus further comprises a third determining module configured to:
and determining the text similarity between the query document and each candidate document according to the target key sentence set of the query document and the target key sentence sets of the candidate documents.
Optionally, the apparatus further comprises a recall module configured to:
and recalling similar documents of the query document from the candidate documents according to the text similarity.
Optionally, the apparatus further comprises a second obtaining module configured to:
acquiring a pre-trained similarity analysis model, wherein the similarity analysis model is obtained by training based on a sample statement set carrying a similarity label;
a third determination module further configured to:
and inputting the target key sentence set of the query document and the target key sentence sets of the candidate documents into the similarity analysis model to obtain the text similarity between the query document and each candidate document.
Optionally, the apparatus further comprises a first training module configured to:
acquiring a preset language characterization model and a sample set, wherein the sample set comprises a plurality of sample statement set pairs carrying similarity labels, and the sample statement set pairs comprise a first sample statement set and a second sample statement set;
extracting any sample statement set pair from the sample set, inputting a first sample statement set and a second sample statement set in the sample statement set pair to a language representation model, and obtaining the prediction similarity of the first sample statement set and the second sample statement set;
determining a loss value according to the prediction similarity and a similarity label carried by the sample statement pair;
and adjusting model parameters of the language characterization model according to the loss value, continuously executing the step of extracting any sample statement set pair from the sample set, and determining the trained language characterization model as a similarity analysis model under the condition of reaching a first preset training stop condition.
Optionally, the first determining module 504 is further configured to:
determining semantic association degrees of the first semantic features and the second semantic features;
and determining a second key sentence set from each text sentence according to the semantic association degree.
Optionally, the apparatus further comprises a third obtaining module configured to:
acquiring a pre-trained relevance analysis model, wherein the relevance analysis model comprises a feature extraction submodel and a relevance operator model;
a first determination module 504, further configured to:
inputting the keywords and each text sentence in the target document into a feature extraction submodel to obtain first semantic features of the keywords and second semantic features of each text sentence;
and inputting the first semantic features and the second semantic features into the relevance degree calculation operator model to obtain the semantic relevance degree of the first semantic features and the second semantic features.
Optionally, the apparatus further comprises a second training module configured to:
acquiring a preset neural network model and a training set, wherein the neural network model comprises a feature extraction submodel and an association degree operator model, the training set comprises a plurality of sample pairs carrying association degree labels, and the sample pairs comprise sample words and sample sentences;
extracting any sample pair from the training set, inputting sample words and sample sentences in the sample pair into the feature extraction submodel, and obtaining first prediction features of the sample words and second prediction features of the sample sentences;
inputting the first prediction characteristic and the second prediction characteristic into an association degree operator model to obtain the prediction association degree of the first prediction characteristic and the second prediction characteristic;
determining a difference value according to the predicted relevance and the relevance label carried by the sample pair;
and adjusting model parameters of the feature extraction submodel and the relevance meter submodel according to the difference values, continuously executing the step of extracting any sample pair from the training set, and determining the trained neural network model as the relevance analysis model under the condition of reaching a second preset training stopping condition.
The key sentence extraction device determines the first key sentence set according to the text content of the target document, and ensures that the key sentences in the first key sentence set carry text level information; the second key sentence set is determined through the first semantic features of the keywords and the second semantic features of all text sentences in the target document, so that the key sentences can be determined from the semantic level more accurately, that is, the key sentences in the first key sentence set are ensured to carry semantic level information, and further, the target key sentence set is determined according to the first key sentence set and the second key sentence set, so that the key sentences in the target key sentences can contain both text level information and semantic level information, and the accuracy rate of determining the key sentences is improved. In addition, based on the key sentence extraction method provided by the application, automatic extraction of the key sentences is realized, the key sentences are prevented from being extracted by spending a large amount of manpower and material resources while the accuracy rate of the key sentences is ensured, the efficiency of extracting the key sentences is improved, and the cost of extracting the key sentences is reduced.
The above is a schematic scheme of a key sentence extracting apparatus of this embodiment. It should be noted that the technical solution of the key sentence extracting apparatus and the technical solution of the key sentence extracting method belong to the same concept, and details of the technical solution of the key sentence extracting apparatus, which are not described in detail, can be referred to the description of the technical solution of the key sentence extracting method. Further, the components in the device embodiment should be understood as functional blocks that must be created to implement the steps of the program flow or the steps of the method, and each functional block is not actually divided or separately defined. The device claims defined by such a set of functional modules are to be understood as a functional module framework for implementing the solution mainly by means of a computer program as described in the specification, and not as a physical device for implementing the solution mainly by means of hardware.
Fig. 6 illustrates a block diagram of a computing device 600 according to an embodiment of the present application. The components of the computing device 600 include, but are not limited to, a memory 610 and a processor 620. The processor 620 is coupled to the memory 610 via a bus 630 and a database 650 is used to store data.
Computing device 600 also includes access device 640, access device 640 enabling computing device 600 to communicate via one or more networks 660. Examples of such networks include the Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the internet. Access device 640 may include one or more of any type of network interface (e.g., a Network Interface Card (NIC)) whether wired or wireless, such as an IEEE802.11 Wireless Local Area Network (WLAN) wireless interface, a worldwide interoperability for microwave access (Wi-MAX) interface, an ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a bluetooth interface, a Near Field Communication (NFC) interface, and so forth.
In one embodiment of the present application, the above-described components of computing device 600, as well as other components not shown in FIG. 6, may also be connected to each other, such as by a bus. It should be understood that the block diagram of the computing device architecture shown in FIG. 5 is for purposes of example only and is not limiting as to the scope of the present application. Those skilled in the art may add or replace other components as desired.
Computing device 600 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), mobile phone (e.g., smartphone), wearable computing device (e.g., smartwatch, smartglasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or PC. Computing device 600 may also be a mobile or stationary server.
Wherein the processor 620 is configured to execute the computer-executable instructions of the key sentence extraction method.
The above is an illustrative scheme of a computing device of the present embodiment. It should be noted that the technical solution of the computing device and the technical solution of the above-mentioned key sentence extraction method belong to the same concept, and details that are not described in detail in the technical solution of the computing device can be referred to the description of the technical solution of the above-mentioned key sentence extraction method.
An embodiment of the present application also provides a computer-readable storage medium storing computer instructions that, when executed by a processor, are used for a key sentence extraction method.
The above is an illustrative scheme of a computer-readable storage medium of the present embodiment. It should be noted that the technical solution of the storage medium and the technical solution of the key sentence extraction method belong to the same concept, and details that are not described in detail in the technical solution of the storage medium can be referred to the description of the technical solution of the key sentence extraction method.
The computer instructions comprise computer program code which may be in the form of source code, object code, an executable file or some intermediate form, or the like. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, etc.
An embodiment of the present application further provides a chip, in which a computer program is stored, and when the computer program is executed by the chip, the steps of the key sentence extraction method are implemented.
It should be noted that, for the sake of simplicity, the above-mentioned method embodiments are described as a series of acts or combinations, but those skilled in the art should understand that the present application is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
The preferred embodiments of the present application disclosed above are intended only to aid in the explanation of the application. Alternative embodiments are not exhaustive and do not limit the invention to the precise embodiments described. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the application and its practical applications, to thereby enable others skilled in the art to best understand and utilize the application. The application is limited only by the claims and their full scope and equivalents.

Claims (16)

1. A method for extracting a key sentence, comprising:
acquiring a target document, and extracting a keyword and a first key sentence set based on the text content of the target document;
extracting first semantic features of the keywords and second semantic features of text sentences in the target document, and determining a second key sentence set according to the first semantic features and the second semantic features;
and determining a target key sentence set according to the first key sentence set and the second key sentence set.
2. The method of claim 1, wherein the first set of key sentences comprises a third set of key sentences and a fourth set of key sentences;
extracting keywords and a first key sentence set based on the text content of the target document, wherein the extracting includes:
extracting keywords and a third key sentence set of the target document by utilizing a character content-based extraction algorithm according to the character content of the target document;
and identifying a target text sentence containing the keyword in the target document according to the keyword, and constructing a fourth keyword sentence set of the target document based on the target text sentence.
3. The method according to claim 2, wherein extracting the keywords of the target document by using a text-based extraction algorithm according to the text content of the target document comprises:
performing word segmentation and word stop removal processing on the text content of the target document to obtain a plurality of candidate words;
constructing a word graph by taking each candidate word as a node and taking a co-occurrence relation between the candidate words as an edge according to a preset sliding window;
according to the word graph, iteratively calculating a first initial weight corresponding to each candidate word until a first preset convergence condition is reached to obtain a first target weight corresponding to each candidate word;
determining keywords of the target document from the candidate words based on the first target weight.
4. The method of claim 2, wherein extracting a third set of key sentences according to the text content of the target document by using a text-based extraction algorithm comprises:
performing sentence division processing on the text content of the target document to obtain a plurality of candidate sentences;
constructing a sentence graph by taking the candidate sentences as nodes and the sentence similarity among the candidate sentences as edges;
according to the sentence graph, iteratively calculating a second initial weight corresponding to each candidate sentence until a second preset convergence condition is reached to obtain a second target weight corresponding to each candidate word;
determining a third set of key sentences of the target document from the candidate sentences based on the second target weight.
5. The method of claim 2, wherein determining a set of target key sentences from the first set of key sentences and the second set of key sentences comprises:
and solving the intersection of the second key sentence set, the third key sentence set and the fourth key sentence set to obtain a target key sentence set.
6. The method of claim 5, wherein the intersecting the second set of key sentences, the third set of key sentences and the fourth set of key sentences to obtain a target set of key sentences comprises:
determining a first initial confidence degree of an initial key sentence relative to a second key sentence set, a second initial confidence degree relative to a third key sentence set and a third initial confidence degree relative to a fourth key sentence set, wherein the initial key sentence refers to any key sentence in the second key sentence set, the third key sentence set and the fourth key sentence;
determining a target confidence level of the initial key sentence according to the first initial confidence level, the second initial confidence level and the third initial confidence level;
determining a target key sentence from the second key sentence set, the third key sentence set and the fourth key sentence set based on the target confidence;
and constructing the target key sentence set based on the target key sentences.
7. The method of any of claims 1-6, wherein the target document comprises a query document and a plurality of candidate documents;
after determining a target key sentence set according to the first key sentence set and the second key sentence set, the method further includes:
and determining the text similarity between the query document and each candidate document according to the target key sentence set of the query document and the target key sentence sets of the candidate documents.
8. The method of claim 7, after determining the text similarity of the query document to each candidate document according to the target key sentence set of the query document and the target key sentence sets of the candidate documents, further comprising:
and recalling similar documents of the query document from the candidate documents according to the text similarity.
9. The method according to claim 7, before determining the text similarity between the query document and each candidate document according to the target key sentence set of the query document and the target key sentence sets of the candidate documents, further comprising:
acquiring a pre-trained similarity analysis model, wherein the similarity analysis model is obtained by training based on a sample statement set pair carrying a similarity label;
determining the text similarity between the query document and each candidate document according to the target key sentence set of the query document and the target key sentence sets of the candidate documents, including:
and inputting the target key sentence set of the query document and the target key sentence sets of the candidate documents into the similarity analysis model to obtain the text similarity between the query document and each candidate document.
10. The method of claim 9, further comprising, prior to the obtaining the pre-trained similarity analysis model:
acquiring a preset language characterization model and a sample set, wherein the sample set comprises a plurality of sample statement set pairs carrying similarity labels, and the sample statement set pairs comprise a first sample statement set and a second sample statement set;
extracting any sample statement set pair from the sample set, inputting a first sample statement set and a second sample statement set in the sample statement set pair to the language characterization model, and obtaining the prediction similarity of the first sample statement set and the second sample statement set;
determining a loss value according to the prediction similarity and a similarity label carried by the sample statement pair;
and adjusting model parameters of the language characterization model according to the loss value, continuously executing the step of extracting any sample statement set pair from the sample set, and determining the trained language characterization model as a similarity analysis model under the condition of reaching a first preset training stop condition.
11. The method of claim 1, wherein determining a second set of key sentences from the first semantic features and each of the second semantic features comprises:
determining semantic association degree of the first semantic features and each second semantic feature;
and determining a second key sentence set from each text sentence according to the semantic association degree.
12. The method of claim 11, further comprising, prior to the extracting the first semantic features of the keywords and the second semantic features of the text sentences in the target document:
acquiring a pre-trained relevance analysis model, wherein the relevance analysis model comprises a feature extraction sub-model and a relevance meter operator model;
the extracting of the first semantic features of the keywords and the second semantic features of the text sentences in the target document includes:
inputting the keywords and each text sentence in the target document into the feature extraction submodel to obtain a first semantic feature of the keywords and a second semantic feature of each text sentence;
the determining the semantic relevance of the first semantic feature and each of the second semantic features includes:
and inputting the first semantic features and the second semantic features into the relevance degree calculation operator model to obtain the semantic relevance degree of the first semantic features and the second semantic features.
13. The method of claim 12, wherein before obtaining the pre-trained relevance analysis model, further comprising:
acquiring a preset neural network model and a training set, wherein the neural network model comprises a feature extraction submodel and an association degree operator model, the training set comprises a plurality of sample pairs carrying association degree labels, and the sample pairs comprise sample words and sample sentences;
extracting any sample pair from the training set, inputting sample words and sample sentences in the sample pair into the feature extraction submodel, and obtaining first prediction features of the sample words and second prediction features of the sample sentences;
inputting the first prediction feature and the second prediction feature into the relevance degree operator model to obtain the prediction relevance degree of the first prediction feature and the second prediction feature;
determining a difference value according to the prediction relevance and a relevance label carried by the sample pair;
and adjusting model parameters of the feature extraction submodel and the relevance meter submodel according to the difference values, continuously executing the step of extracting any sample pair from the training set, and determining the trained neural network model as a relevance analysis model under the condition of reaching a second preset training stop condition.
14. A key sentence extraction device, comprising:
the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is configured to acquire a target document and extract a keyword and a first key sentence set based on the text content of the target document;
the first determining module is configured to extract a first semantic feature of the keyword and a second semantic feature of each text statement in the target document, and determine a second key sentence set according to the first semantic feature and each second semantic feature;
a second determining module configured to determine a target set of key sentences according to the first set of key sentences and the second set of key sentences.
15. A computing device, comprising:
a memory and a processor;
the memory is used for storing computer-executable instructions, and the processor is used for executing the computer-executable instructions to realize the steps of the key sentence extracting method in any one of claims 1 to 13.
16. A computer readable storage medium storing computer instructions which, when executed by a processor, perform the steps of the key sentence extraction method of any one of claims 1 to 13.
CN202210412327.4A 2022-04-19 2022-04-19 Key sentence extraction method and device Pending CN114818727A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210412327.4A CN114818727A (en) 2022-04-19 2022-04-19 Key sentence extraction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210412327.4A CN114818727A (en) 2022-04-19 2022-04-19 Key sentence extraction method and device

Publications (1)

Publication Number Publication Date
CN114818727A true CN114818727A (en) 2022-07-29

Family

ID=82506319

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210412327.4A Pending CN114818727A (en) 2022-04-19 2022-04-19 Key sentence extraction method and device

Country Status (1)

Country Link
CN (1) CN114818727A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115455950A (en) * 2022-09-27 2022-12-09 中科雨辰科技有限公司 Data processing system for acquiring text

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102779119A (en) * 2012-06-21 2012-11-14 盘古文化传播有限公司 Method and device for extracting keywords
CN109582949A (en) * 2018-09-14 2019-04-05 阿里巴巴集团控股有限公司 Event element abstracting method, calculates equipment and storage medium at device
CN111460099A (en) * 2020-03-30 2020-07-28 招商局金融科技有限公司 Keyword extraction method, device and storage medium
CN111460131A (en) * 2020-02-18 2020-07-28 平安科技(深圳)有限公司 Method, device and equipment for extracting official document abstract and computer readable storage medium
CN111680505A (en) * 2020-04-21 2020-09-18 华东师范大学 Markdown feature perception unsupervised keyword extraction method
CN112164391A (en) * 2020-10-16 2021-01-01 腾讯科技(深圳)有限公司 Statement processing method and device, electronic equipment and storage medium
CN112597776A (en) * 2021-03-08 2021-04-02 中译语通科技股份有限公司 Keyword extraction method and system
CN113590768A (en) * 2020-04-30 2021-11-02 北京金山数字娱乐科技有限公司 Training method and device of text relevance model and question-answering method and device

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102779119A (en) * 2012-06-21 2012-11-14 盘古文化传播有限公司 Method and device for extracting keywords
CN109582949A (en) * 2018-09-14 2019-04-05 阿里巴巴集团控股有限公司 Event element abstracting method, calculates equipment and storage medium at device
CN111460131A (en) * 2020-02-18 2020-07-28 平安科技(深圳)有限公司 Method, device and equipment for extracting official document abstract and computer readable storage medium
CN111460099A (en) * 2020-03-30 2020-07-28 招商局金融科技有限公司 Keyword extraction method, device and storage medium
CN111680505A (en) * 2020-04-21 2020-09-18 华东师范大学 Markdown feature perception unsupervised keyword extraction method
CN113590768A (en) * 2020-04-30 2021-11-02 北京金山数字娱乐科技有限公司 Training method and device of text relevance model and question-answering method and device
CN112164391A (en) * 2020-10-16 2021-01-01 腾讯科技(深圳)有限公司 Statement processing method and device, electronic equipment and storage medium
CN112597776A (en) * 2021-03-08 2021-04-02 中译语通科技股份有限公司 Keyword extraction method and system

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115455950A (en) * 2022-09-27 2022-12-09 中科雨辰科技有限公司 Data processing system for acquiring text

Similar Documents

Publication Publication Date Title
Guu et al. Retrieval augmented language model pre-training
CN111753060B (en) Information retrieval method, apparatus, device and computer readable storage medium
Devika et al. Sentiment analysis: a comparative study on different approaches
CN112131350B (en) Text label determining method, device, terminal and readable storage medium
CN113011533A (en) Text classification method and device, computer equipment and storage medium
CN110705206B (en) Text information processing method and related device
CN111753167B (en) Search processing method, device, computer equipment and medium
CN110347790B (en) Text duplicate checking method, device and equipment based on attention mechanism and storage medium
CN110879834B (en) Viewpoint retrieval system based on cyclic convolution network and viewpoint retrieval method thereof
CN108875065B (en) Indonesia news webpage recommendation method based on content
CN110895559A (en) Model training method, text processing method, device and equipment
CN112434164B (en) Network public opinion analysis method and system taking topic discovery and emotion analysis into consideration
CN111581364B (en) Chinese intelligent question-answer short text similarity calculation method oriented to medical field
CN113392179A (en) Text labeling method and device, electronic equipment and storage medium
CN112307190A (en) Medical literature sorting method and device, electronic equipment and storage medium
CN115248839A (en) Knowledge system-based long text retrieval method and device
CN114942994A (en) Text classification method, text classification device, electronic equipment and storage medium
CN113961686A (en) Question-answer model training method and device, question-answer method and device
CN114818727A (en) Key sentence extraction method and device
CN116933782A (en) E-commerce text keyword extraction processing method and system
CN114943236A (en) Keyword extraction method and device
CN114003706A (en) Keyword combination generation model training method and device
CN114896404A (en) Document classification method and device
CN114328820A (en) Information searching method and related equipment
CN114090778A (en) Retrieval method and device based on knowledge anchor point, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination