CN113918708A - Abstract extraction method - Google Patents

Abstract extraction method Download PDF

Info

Publication number
CN113918708A
CN113918708A CN202111532196.5A CN202111532196A CN113918708A CN 113918708 A CN113918708 A CN 113918708A CN 202111532196 A CN202111532196 A CN 202111532196A CN 113918708 A CN113918708 A CN 113918708A
Authority
CN
China
Prior art keywords
words
word
level
semantic
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111532196.5A
Other languages
Chinese (zh)
Other versions
CN113918708B (en
Inventor
胡为民
郑喜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Dib Enterprise Risk Management Technology Co ltd
Original Assignee
Shenzhen Dib Enterprise Risk Management Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Dib Enterprise Risk Management Technology Co ltd filed Critical Shenzhen Dib Enterprise Risk Management Technology Co ltd
Priority to CN202111532196.5A priority Critical patent/CN113918708B/en
Publication of CN113918708A publication Critical patent/CN113918708A/en
Application granted granted Critical
Publication of CN113918708B publication Critical patent/CN113918708B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/374Thesaurus
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • G06F18/2135Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on approximation criteria, e.g. principal component analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to the technical field of natural language processing, in particular to a method for abstracting an abstract, which comprises the following steps: s1, preprocessing, namely generalizing the numerical value and time type data in the bulletin text; s2, constructing a first word list; s3, constructing a word co-occurrence matrix of the first word list; s4, reducing the dimension of the word co-occurrence matrix, and extracting semantic representations of all words in the first word list; s5, repeating S2 to S4, and extracting semantic representations of all words in the bulletin text; s6, accumulating and combining the semantic representations by taking the sentences as units to form sentence context semantic representations; s7, inputting a key phrase by a user, and extracting semantic representation of the key phrase; and S8, judging the similarity between the semantic representation of the key phrase and the semantic representation of the sentence context, and if the similarity of the key phrase is more than a set value, extracting the bulletin text sentence comprising the key phrase into a public text abstract. The association degree between the content of the abstract and the keywords input by the user is high.

Description

Abstract extraction method
Technical Field
The invention relates to the technical field of natural language processing, in particular to a method for abstracting an abstract.
Background
At present, the number of enterprises appearing on the market is increasing day by day, and the public notice of the companies appearing on the market, namely the temporary or annual relevant operating conditions of finance, business and the like of the companies appearing on the market, contains a large amount of information; however, the bulletin text lacks of standard writing specifications and is long in space, reading and data analysis are not facilitated, and key sentences and other information need to be manually extracted from the bulletin text by data analysis and auditors, so that the working efficiency is low. Therefore, it is necessary to provide a method for extracting the abstract of the public company bulletin text, which compresses the text of the bulletin text, removes the 'redundant' information which is not concerned by the analysis and audit staff, and improves the work efficiency of the analysis and audit staff.
At present, a relevant abstract extraction method is available, but sentences containing key words or words similar to the key words in semantics are mainly searched through full text, and extracted and synthesized abstracts are obtained. The method mainly adopts the technology of word vector similarity calculation. However, the existing abstract extraction method has some problems when applied to the public company bulletin text, which is mainly reflected in that only semantic associations among keywords are considered, the semantic associations between the keywords and paragraphs and chapters are not considered, and part of the keywords extend through the whole bulletin text, so that the abstract extraction content is not accurate enough.
Disclosure of Invention
The invention provides a method for extracting an abstract, which aims to solve the problem that the content of the abstract extracted by the existing abstract extracting method is not accurate enough and has high association degree with keywords input by a user.
A method for abstracting abstract comprises the following steps:
s1, preprocessing, namely generalizing the numerical value and time type data in the bulletin text;
s2, constructing a first word list;
s3, constructing a word co-occurrence matrix of the first word list;
s4, reducing the dimension of the word co-occurrence matrix, and extracting semantic representations of all words in the first word list;
s5, repeating S2 to S4, and extracting semantic representations of all words in the bulletin text;
s6, accumulating and combining the semantic representations by taking the sentences as units to form sentence context semantic representations;
s7, inputting a key phrase by a user, and extracting semantic representation of the key phrase;
and S8, judging the similarity between the semantic representation of the key phrase and the semantic representation of the sentence context, and if the similarity of the key phrase is more than a set value, extracting the bulletin text sentence comprising the key phrase into a public text abstract.
In the method, the semantic representation of the words is extracted, the similarity between the semantic representation of the sentence context and the semantic representation of the key phrase is judged, the sentences with the similarity larger than a set value are extracted to form a bulletin text abstract, and the association degree between the abstract content and the key words input by a user is high;
further, replacing the bulletin textTextThe numerical value in the text is a Chinese character numerical value, and the announcement text is replacedTextThe middle time is the Chinese character time;
eliminating the mark number in the punctuation mark, and the pause number and the colon number in the point number, and decomposing the bulletin text into sentences by using the reserved point number as a separator; bulletin text by adopting jieba word-separating methodTextPerforming Chinese word segmentation, after eliminating stop words in the Chinese words, weighting the words by adopting TFIDF, and arranging the words from large to small according to weight;
further, the step of constructing the first vocabulary in S2 includes obtaining words of 2000 words before the weight arrangement to construct the first vocabularyWords
Figure 37095DEST_PATH_IMAGE001
Whereinw i Is shown asiThe number of the words is one,w j is shown asjThe number of the words is one,nis the number of words
Figure 106682DEST_PATH_IMAGE002
Further, the S3 includes the step of,
for any two words appearing in the same sentence, the same paragraph and the same chapterw i Andw j establishing association and constructing word co-occurrence matrix
Figure 897921DEST_PATH_IMAGE003
Figure 789785DEST_PATH_IMAGE004
Is a sentence level word co-occurrence matrix;
Figure 927243DEST_PATH_IMAGE005
is a paragraph level word co-occurrence matrix;
Figure 167731DEST_PATH_IMAGE006
is a discourse level co-occurrence matrix; matrix row indexiColumn indexjRespectively representing two co-occurrence wordsw i Andw j an index of (2); the elements in the matrix represent the joint probability of two words pointed by row and column indexes
Figure 711845DEST_PATH_IMAGE007
Further, the step S4 includes reducing dimensions of the sentence-level co-occurrence matrix of words, the paragraph-level co-occurrence matrix of words, and the chapter-level co-occurrence matrix of words by using a principal component analysis method, where the dimensions after the dimension reduction are 2000 × 100, where 2000 represents the number of words and 100 represents the dimension of each semantic vector of words; after dimensionality reduction, the three-level vector of the word co-occurrence matrix is three-level semantic representation; the three-level semantic representations are sentence-level, paragraph-level and chapter-level semantic representations of words; and extracting three-level semantic representations of all the words in the first word list.
Further, the dimension reduction calculation formula is as follows:
Figure 328771DEST_PATH_IMAGE009
wherein
Figure 228725DEST_PATH_IMAGE010
Represents the k-th row vector standard deviation;
Figure 764749DEST_PATH_IMAGE011
to represent
Figure 671525DEST_PATH_IMAGE012
To (1)kA row vector;
Figure 203394DEST_PATH_IMAGE013
representing a covariance matrix;
Figure 738280DEST_PATH_IMAGE014
the first 100 columns of feature column vectors representing the covariance matrix;
Figure 399200DEST_PATH_IMAGE015
means word co-occurrence matrixkThree levels of semantic representation of individual words.
Further, the S5 is repeated each time, and S2 constructs a vocabulary respectively until all words in the bulletin text are included, and the vocabulary is sequentially words 2000 before weight arrangement;
further, the statement context three-level semantic representation is
Figure 58851DEST_PATH_IMAGE016
Wherein t is the t-th word in the sentence.
The S7 comprises the steps that a user inputs a key phrase, and three-level semantic representations of all key words of the key phrase are extracted; accumulating and combining the three-level semantic representations of all the keywords to form a keyword group three-level semantic representation; the key phrase three-level semantic representation is
Figure 79897DEST_PATH_IMAGE018
Wherein t is the t-th word in the key phrase.
Furthermore, a semantic similarity calculation model based on a twin neural network is constructed, the semantic similarity calculation model based on the twin neural network comprises two groups of isomorphic feedback neural networks, the input of the semantic similarity calculation model based on the twin neural network is the three-level semantic representation of statement context and the three-level semantic representation of user key word groups, and the output is similarity;
inputting three-level semantic representations of statement context and three-level semantic representations of user key phrases, and extracting statements corresponding to the input three-level semantic representations of statement context when the similarity is greater than a set value;
and repeating the steps, and sequentially inputting all the three-level semantic representations of the sentence context in the bulletin text until all the sentences with the similarity between the three-level semantic representations of the sentence context in the bulletin text and the three-level semantic representations of the user key phrase being more than a set value are extracted to form the bulletin text abstract.
Has the advantages that: by extracting semantic representations of words, the similarity between the sentence context semantic representation and the keyword group semantic representation is judged, sentences with the similarity larger than a set value are extracted to form a bulletin text abstract, and the association degree between the abstract content and keywords input by a user is high; the 'redundant' information which is not concerned by the user can be effectively removed, and the working efficiency of the user is improved.
Drawings
The invention is described in further detail below with reference to the figures and specific embodiments.
Fig. 1 is a flowchart of the present embodiment.
FIG. 2 is an architecture diagram of the present embodiment of a twin neural network.
Detailed Description
The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it should be understood that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the description of the present invention, it should be noted that the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc., indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of description and simplicity of description, but do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
In the description of the present invention, it should be noted that, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art. In addition, the technical features involved in the different embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
As shown in fig. 1, the present embodiment provides a method for extracting a summary by taking a public company bulletin text as an example, and specifically includes the following steps.
S1, preprocessing, namely generalizing the numerical value and time type data in the bulletin text; comprises the steps of (a) preparing a mixture of a plurality of raw materials,
replacement bulletin textTextThe numerical value in the text is a Chinese character numerical value, and the announcement text is replacedTextThe middle time is the Chinese character time;
eliminating the mark number in the punctuation mark, and the pause number and the colon number in the point number, and decomposing the bulletin text into sentences by using the reserved point number as a separator; bulletin text by adopting jieba word-separating methodTextPerforming Chinese word segmentation, and obtaining a bulletin text after eliminating stop words in the Chinese word segmentationTextThe term (a);
and weighting the words by adopting TFIDF, and arranging the words from large to small according to the weight.
S2, constructing a first vocabulary, including,
obtaining words of 2000 before weight arrangement to construct first word listWords
Figure 718558DEST_PATH_IMAGE019
Whereinw i Is shown asiThe number of the words is one,w j is shown asjThe number of the words is one,nis the number of words
Figure 799646DEST_PATH_IMAGE020
S3, constructing a word co-occurrence matrix of the first word list;
for any two words appearing in the same sentence, the same paragraph and the same chapterw i Andw j establishing association and constructing word co-occurrence matrix
Figure 556381DEST_PATH_IMAGE021
Figure 318800DEST_PATH_IMAGE022
Is a sentence level word co-occurrence matrix;
Figure 93858DEST_PATH_IMAGE023
is a paragraph level word co-occurrence matrix;
Figure 752373DEST_PATH_IMAGE024
is a discourse level co-occurrence matrix;
matrix row indexiColumn indexjRespectively representing two co-occurrence wordsw i Andw j an index of (2); the elements in the matrix represent the joint probability of two words pointed by row and column indexes
Figure 497868DEST_PATH_IMAGE025
S4, reducing the dimension of the word co-occurrence matrix, and extracting semantic representations of all words in the first word list; comprises the steps of (a) preparing a mixture of a plurality of raw materials,
reducing dimensions of the sentence-level word co-occurrence matrix, the paragraph-level word co-occurrence matrix and the chapter-level word co-occurrence matrix by adopting a principal component analysis method, wherein the dimensions after the dimension reduction are 2000 x 100, wherein 2000 represents the number of words, and 100 represents the dimension of each word semantic vector; after dimensionality reduction, the three-level vector of the word co-occurrence matrix is three-level semantic representation; the three-level semantic representations are sentence-level, paragraph-level and chapter-level semantic representations of words; the dimensionality reduction calculation formula is as follows:
Figure 532820DEST_PATH_IMAGE027
wherein
Figure 631226DEST_PATH_IMAGE010
Represents the k-th row vector standard deviation;
Figure 601587DEST_PATH_IMAGE011
to represent
Figure 51023DEST_PATH_IMAGE012
To (1)kA row vector;
Figure 889666DEST_PATH_IMAGE028
representing a covariance matrix;
Figure 826267DEST_PATH_IMAGE029
the first 100 columns of feature column vectors representing the covariance matrix;
Figure 685639DEST_PATH_IMAGE030
to representFirst in the word co-occurrence matrixkThree levels of semantic representation of individual words.
S2-S4 form a three-level semantic coding method, and three-level semantic representations of all words in the bulletin text are extracted by the three-level semantic coding method. And splitting the sentence as a unit, and accumulating and combining the three-level semantic representations of the sentence context words to form the three-level semantic representation of the sentence context.
S5, repeating S2 to S4, and extracting semantic representations of all words in the bulletin text; repeating each time, S2 respectively constructing a word list until all words in the bulletin text are included, wherein the word list is sequentially the words 2000 before the weight arrangement;
s6, accumulating and combining the semantic representations by taking the sentences as units to form sentence context semantic representations;
the sentence context three-level semantic representation is as follows:
Figure 373103DEST_PATH_IMAGE016
wherein t is the t-th word in the sentence.
S7, inputting key phrase by user, extracting semantic representation of key phrase, including
Inputting a key phrase by a user, and extracting three-level semantic representations of all key words of the key phrase; accumulating and combining the three-level semantic representations of all the keywords to form a keyword group three-level semantic representation;
the third-level semantic representation of the key phrase is as follows:
Figure 343333DEST_PATH_IMAGE032
in the formula, t is the t-th word in the keyword group.
S8, judging the similarity between the sentence context semantic representation and the key phrase semantic representation, and extracting sentences with the similarity larger than a set value to form a bulletin text abstract; comprises the steps of (a) preparing a mixture of a plurality of raw materials,
constructing a semantic similarity calculation model based on the twin neural network as follows:
Figure 273723DEST_PATH_IMAGE033
as shown in fig. 2, the semantic similarity calculation model based on the twin neural network includes two groups of isomorphic feedback neural networks, the inputs of which are three-level semantic representations of sentence context and three-level semantic representations of user key phrases, and the output is similarity; the semantic similarity calculation model based on the twin neural network specifically comprises two independent parallel input layers, two independent parallel hidden layers and an output layer; the input layer dimension is 1 x 100; the hidden layer dimension is 1 x 10; the two independent parallel input layers are respectively connected with the two independent parallel hidden layers by adopting a Sigmoid activated function, and the two independent parallel hidden layers are commonly connected with the output layer by adopting the Sigmoid activated function; the output layer is a cross entropy loss function; the output of the output layer is similarity;
adopting the three-level semantic representation of the statement context and the three-level semantic representation of the user key word group as the input of a semantic similarity calculation model based on a twin neural network, training the semantic similarity calculation model based on the twin neural network, and calculating the similarity of the three-level semantic representation of the statement context and the three-level semantic representation of the user key word group through the semantic similarity calculation model based on the twin neural networkSimilarity(Text,keywords)
Specifically, the sentence context three-level semantic representation adopts a Sigmoid activated function to connect one of two independent parallel input layers, and the keyword group three-level semantic representation adopts a Sigmoid activated function to connect the other input layer;
judging the similarity between the semantic representation of the key phrase and the semantic representation of the sentence context, wherein the set value is 0.7, and when the similarity is high
Figure 976100DEST_PATH_IMAGE034
And extracting the bulletin text sentences including the key phrases into public text abstracts.
S6-S8 form a summary extraction method for context semantic similarity calculation, and extract sentences containing key information in the bulletin text to form a summary.
The abstract extraction method provided by the embodiment extracts three levels of semantic representations of a sentence level, a paragraph level and a chapter level of words by a three-level semantic coding method; and judging the similarity between the sentence context semantic representation and the key phrase semantic representation by a abstract extraction method of context semantic similarity calculation, and extracting sentences with the similarity larger than a set value to form the abstract of the bulletin text.
The abstract extraction method provided by the implementation considers the relevance of the keywords input by the user with sentences, paragraphs and chapters, and accurately extracts the sentences with high relevance with the keywords input by the user to form abstract texts; the 'redundant' information which is not concerned by the user can be effectively removed, and the working efficiency of the user is improved.
It should be understood that the above examples are only for clarity of illustration and are not intended to limit the embodiments. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. And obvious variations or modifications of the invention may be made without departing from the spirit or scope of the invention.

Claims (10)

1. A method for abstracting an abstract is characterized by comprising the following steps:
s1, preprocessing, namely generalizing the numerical value and time type data in the bulletin text;
s2, constructing a first word list;
s3, constructing a word co-occurrence matrix of the first word list;
s4, reducing the dimension of the word co-occurrence matrix, and extracting semantic representations of all words in the first word list;
s5, repeating S2 to S4, and extracting semantic representations of all words in the bulletin text;
s6, accumulating and combining the semantic representations by taking the sentences as units to form sentence context semantic representations;
s7, inputting a key phrase by a user, and extracting semantic representation of the key phrase;
and S8, judging the similarity between the semantic representation of the key phrase and the semantic representation of the sentence context, and if the similarity of the key phrase is more than a set value, extracting the bulletin text sentence comprising the key phrase into a public text abstract.
2. The digest extraction method according to claim 1, wherein the S1 includes,
replacement bulletin textTextThe numerical value in the text is a Chinese character numerical value, and the announcement text is replacedTextThe middle time is the Chinese character time;
eliminating the mark number in the punctuation mark, and the pause number and the colon number in the point number, and decomposing the bulletin text into sentences by using the reserved point number as a separator; bulletin text by adopting jieba word-separating methodTextPerforming Chinese word segmentation, after eliminating stop words in the Chinese words, weighting the words by adopting TFIDF, and arranging the words from large to small according to the weight.
3. The method for abstracting abstract as claimed in claim 2, wherein said S2 constructing the first vocabulary includes constructing the first vocabulary by obtaining 2000 words before weight arrangementWords
Figure 222622DEST_PATH_IMAGE001
Whereinw i Is shown asiThe number of the words is one,w j is shown asjThe number of the words is one,nis the number of words
Figure 608604DEST_PATH_IMAGE002
4. The digest extraction method according to claim 3, wherein the S3 includes,
for any two words appearing in the same sentence, the same paragraph and the same chapterw i Andw j establishing association and constructing word co-occurrence matrix
Figure 981817DEST_PATH_IMAGE003
Figure 845867DEST_PATH_IMAGE004
Is a sentence level word co-occurrence matrix;
Figure 207710DEST_PATH_IMAGE005
is a paragraph level word co-occurrence matrix;
Figure 623648DEST_PATH_IMAGE006
is a discourse level co-occurrence matrix;
matrix row indexiColumn indexjRespectively representing two co-occurrence wordsw i Andw j an index of (2); the elements in the matrix represent the joint probability of two words pointed by row and column indexes
Figure 359522DEST_PATH_IMAGE007
5. The abstract extraction method as claimed in claim 4, wherein the S4 includes using principal component analysis to perform dimension reduction on the sentence-level word co-occurrence matrix, paragraph-level word co-occurrence matrix, and chapter-level word co-occurrence matrix, respectively, the dimension after dimension reduction is 2000 x 100, where 2000 represents the number of words and 100 represents the dimension of each word semantic vector; after dimensionality reduction, the three-level vector of the word co-occurrence matrix is three-level semantic representation; the three-level semantic representations are sentence-level, paragraph-level and chapter-level semantic representations of words; and extracting three-level semantic representations of all the words in the first word list.
6. The abstract extraction method as claimed in claim 5, wherein the dimension reduction calculation formula is as follows:
Figure 401165DEST_PATH_IMAGE009
wherein
Figure 538886DEST_PATH_IMAGE010
Represents the k-th row vector standard deviation;
Figure 125725DEST_PATH_IMAGE011
to represent
Figure 614475DEST_PATH_IMAGE012
To (1)kA row vector;
Figure 430115DEST_PATH_IMAGE013
representing a covariance matrix;
Figure 422342DEST_PATH_IMAGE014
the first 100 columns of feature column vectors representing the covariance matrix;
Figure 445662DEST_PATH_IMAGE015
means word co-occurrence matrixkThree levels of semantic representation of individual words.
7. The method for abstracting abstract as claimed in claim 6, wherein in S5, each repetition, S2 constructs a vocabulary respectively until all words in the bulletin text are included, and the vocabulary is in turn the words with weight ranking of 2000.
8. The summarization extraction method of claim 7 wherein the sentence context three-level semantic features are characterized as
Figure 156129DEST_PATH_IMAGE016
Wherein t is the t-th word in the sentence.
9. The method for abstracting abstract as claimed in claim 8, wherein the step S7 includes inputting a keyword group by a user, extracting three-level semantic representations of all keywords of the keyword group; accumulating and combining the three-level semantic representations of all the keywords to form a keyword group three-level semantic representation; the key phrase three-level semantic representation is
Figure 554223DEST_PATH_IMAGE018
Wherein t is the t-th word in the key phrase.
10. The digest extraction method according to claim 9, wherein the S8 includes,
the method comprises the steps that a semantic similarity calculation model based on a twin neural network is constructed, the semantic similarity calculation model based on the twin neural network comprises two groups of isomorphic feedback neural networks, the input of the semantic similarity calculation model based on the twin neural network is three-level semantic representation of statement context and three-level semantic representation of user key word groups, and the output is similarity;
inputting three-level semantic representations of statement context and three-level semantic representations of user key phrases, and extracting statements corresponding to the input three-level semantic representations of statement context when the similarity is greater than a set value;
and repeating the steps, and sequentially inputting all the three-level semantic representations of the sentence context in the bulletin text until all the sentences with the similarity between the three-level semantic representations of the sentence context in the bulletin text and the three-level semantic representations of the user key phrase being more than a set value are extracted to form the bulletin text abstract.
CN202111532196.5A 2021-12-15 2021-12-15 Abstract extraction method Active CN113918708B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111532196.5A CN113918708B (en) 2021-12-15 2021-12-15 Abstract extraction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111532196.5A CN113918708B (en) 2021-12-15 2021-12-15 Abstract extraction method

Publications (2)

Publication Number Publication Date
CN113918708A true CN113918708A (en) 2022-01-11
CN113918708B CN113918708B (en) 2022-03-22

Family

ID=79248937

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111532196.5A Active CN113918708B (en) 2021-12-15 2021-12-15 Abstract extraction method

Country Status (1)

Country Link
CN (1) CN113918708B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US12008332B1 (en) 2023-08-18 2024-06-11 Anzer, Inc. Systems for controllable summarization of content

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102646114A (en) * 2012-02-17 2012-08-22 清华大学 News topic timeline abstract generating method based on breakthrough point
CN104679730A (en) * 2015-02-13 2015-06-03 刘秀磊 Webpage summarization extraction method and device thereof
CN110069622A (en) * 2017-08-01 2019-07-30 武汉楚鼎信息技术有限公司 A kind of personal share bulletin abstract intelligent extract method
CN110188349A (en) * 2019-05-21 2019-08-30 清华大学深圳研究生院 A kind of automation writing method based on extraction-type multiple file summarization method
CN110851598A (en) * 2019-10-30 2020-02-28 深圳价值在线信息科技股份有限公司 Text classification method and device, terminal equipment and storage medium
CN111259136A (en) * 2020-01-09 2020-06-09 信阳师范学院 Method for automatically generating theme evaluation abstract based on user preference
WO2021164231A1 (en) * 2020-02-18 2021-08-26 平安科技(深圳)有限公司 Official document abstract extraction method and apparatus, and device and computer readable storage medium
US20210342552A1 (en) * 2020-05-01 2021-11-04 International Business Machines Corporation Natural language text generation from a set of keywords using machine learning and templates

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102646114A (en) * 2012-02-17 2012-08-22 清华大学 News topic timeline abstract generating method based on breakthrough point
CN104679730A (en) * 2015-02-13 2015-06-03 刘秀磊 Webpage summarization extraction method and device thereof
CN110069622A (en) * 2017-08-01 2019-07-30 武汉楚鼎信息技术有限公司 A kind of personal share bulletin abstract intelligent extract method
CN110188349A (en) * 2019-05-21 2019-08-30 清华大学深圳研究生院 A kind of automation writing method based on extraction-type multiple file summarization method
CN110851598A (en) * 2019-10-30 2020-02-28 深圳价值在线信息科技股份有限公司 Text classification method and device, terminal equipment and storage medium
CN111259136A (en) * 2020-01-09 2020-06-09 信阳师范学院 Method for automatically generating theme evaluation abstract based on user preference
WO2021164231A1 (en) * 2020-02-18 2021-08-26 平安科技(深圳)有限公司 Official document abstract extraction method and apparatus, and device and computer readable storage medium
US20210342552A1 (en) * 2020-05-01 2021-11-04 International Business Machines Corporation Natural language text generation from a set of keywords using machine learning and templates

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
李峰等: "一种领域语料驱动的句子相关性计算方法研究", 《计算机科学》 *
黄亚明等: "面向Web文本语义挖掘的SKR/MetaMap输出概念共现分析***的开发尝试", 《现代图书情报技术》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US12008332B1 (en) 2023-08-18 2024-06-11 Anzer, Inc. Systems for controllable summarization of content

Also Published As

Publication number Publication date
CN113918708B (en) 2022-03-22

Similar Documents

Publication Publication Date Title
Daud et al. Urdu language processing: a survey
Oh et al. Why-question answering using intra-and inter-sentential causal relations
CN113704451B (en) Power user appeal screening method and system, electronic device and storage medium
US20080027893A1 (en) Reference resolution for text enrichment and normalization in mining mixed data
Murthy et al. Language identification from small text samples
CN108319583B (en) Method and system for extracting knowledge from Chinese language material library
Petersen et al. Natural Language Processing Tools for Reading Level Assessment and Text Simplication for Bilingual Education
Golpar-Rabooki et al. Feature extraction in opinion mining through Persian reviews
CN113918708B (en) Abstract extraction method
Yan et al. Chemical name extraction based on automatic training data generation and rich feature set
Melero et al. Holaaa!! writin like u talk is kewl but kinda hard 4 NLP
Saleh et al. TxLASM: A novel language agnostic summarization model for text documents
JP6168057B2 (en) Failure occurrence cause extraction device, failure occurrence cause extraction method, and failure occurrence cause extraction program
CN115952794A (en) Chinese-Tai cross-language sensitive information recognition method fusing bilingual sensitive dictionary and heterogeneous graph
Liu et al. Keyword extraction using PageRank on synonym networks
Ali et al. Word embedding based new corpus for low-resourced language: Sindhi
Cui Converting taxonomic descriptions to new digital formats
Das et al. An improvement of Bengali factoid question answering system using unsupervised statistical methods
Hamza et al. Text mining: A survey of Arabic root extraction algorithms
Saneifar et al. From terminology extraction to terminology validation: an approach adapted to log files
Worke INFORMATION EXTRACTION MODEL FROM GE’EZ TEXTS
Temesgen Afaan Oromo News Text Summarization Using Sentence Scoring Method
Modrzejewski Improvement of the Translation of Named Entities in Neural Machine Translation
Saleh et al. TxLASM: A Novel Language Agnostic Summarization Model for Text Documents
Dias Information digestion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
EE01 Entry into force of recordation of patent licensing contract
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20220111

Assignee: Shenzhen Mingji Agricultural Development Co.,Ltd.

Assignor: SHENZHEN DIB ENTERPRISE RISK MANAGEMENT TECHNOLOGY CO.,LTD.

Contract record no.: X2023980049635

Denomination of invention: A Summary Extraction Method

Granted publication date: 20220322

License type: Common License

Record date: 20231204