CN115577082A - Document keyword extraction method and device, electronic equipment and storage medium - Google Patents

Document keyword extraction method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN115577082A
CN115577082A CN202211216904.9A CN202211216904A CN115577082A CN 115577082 A CN115577082 A CN 115577082A CN 202211216904 A CN202211216904 A CN 202211216904A CN 115577082 A CN115577082 A CN 115577082A
Authority
CN
China
Prior art keywords
document
text
structured
candidate
candidate keyword
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211216904.9A
Other languages
Chinese (zh)
Inventor
林荣荣
张小晶
梁志明
支天波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guizhou Xiaoai Robot Technology Co ltd
Original Assignee
Guizhou Xiaoai Robot Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guizhou Xiaoai Robot Technology Co ltd filed Critical Guizhou Xiaoai Robot Technology Co ltd
Priority to CN202211216904.9A priority Critical patent/CN115577082A/en
Publication of CN115577082A publication Critical patent/CN115577082A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/325Hash tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Human Computer Interaction (AREA)
  • Artificial Intelligence (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method and a device for extracting document keywords, electronic equipment and a storage medium. The method comprises the following steps: converting the unstructured long document to be processed into a structured long document; according to the hash fingerprint of each text in the structured long document, carrying out text deduplication processing on the structured long document to obtain a target structured document; identifying each candidate keyword in the target structured document, and calculating the weight value of each candidate keyword according to the word frequency-inverse text frequency index and the information entropy of each candidate keyword in the structured long document; and calculating score values corresponding to the candidate keywords according to the weight values of the candidate keywords and a preset text sorting algorithm, and screening the candidate keywords according to the score values to obtain the document keywords. By executing the technical scheme, the extraction of the keywords in the unstructured long document can be realized, and the accuracy of extracting the keywords in the unstructured long document is improved.

Description

Document keyword extraction method and device, electronic equipment and storage medium
Technical Field
The invention relates to the technical field of automatic identification, in particular to a method and a device for extracting document keywords, electronic equipment and a storage medium.
Background
Keywords, which are summaries of the main content of a document, are an important way to quickly understand the topic of a document. The shadow of the keywords can be seen in various places, for example, we can see the label of each news on the news website, and we can see the keywords discussed in the scientific paper when browsing the paper. The method and the device reduce the difficulty of searching information in mass information. Current keywords have been applied in various fields. The keywords are widely applied in various fields, and the keywords of the document can be accurately and quickly identified, so that a user can more quickly acquire effective information of the document, find out a target file and the like.
In the prior art, the information of the unstructured long document is identified mainly by a method of randomly constructing a deep learning model by using a bidirectional long and short term memory network and conditions based on a deep learning method, and the method has a certain application effect.
In the process of implementing the invention, the inventor finds that the method has the following defects: the information in the unstructured long document has the characteristics of diversity, large information amount and the like, the method can be only applied to smaller documents, namely documents with small information amount, and in larger documents, namely documents with large information amount, the identification precision and reliability of the prior art are obviously reduced.
Disclosure of Invention
The invention provides a method and a device for extracting document keywords, electronic equipment and a storage medium, which aim to solve the problem that the identification precision of a large unstructured long document is reduced in the prior art.
In a first aspect, an embodiment of the present invention provides a method for extracting a document keyword, where the method includes:
converting the unstructured long document to be processed into a structured long document;
according to the Hash fingerprint of each text in the structured long document, carrying out text deduplication processing on the structured long document to obtain a target structured document, wherein each text comprises at least one sentence;
identifying each candidate keyword in the target structured document, and calculating the weight value of each candidate keyword according to the word frequency-inverse text frequency index and the information entropy of each candidate keyword in the structured long document;
and calculating score values corresponding to the candidate keywords according to the weight values of the candidate keywords and a preset text sorting algorithm, and screening the candidate keywords according to the score values to obtain the document keywords.
In a second aspect, an embodiment of the present invention provides an apparatus for extracting a document keyword, where the apparatus includes:
the document structuralization conversion module is used for converting the unstructured long document to be processed into a structured long document;
the document deduplication module is used for performing text deduplication processing on the long structured document according to the hash fingerprint of each text in the long structured document to obtain a target structured document, wherein each text comprises at least one sentence;
the weight value calculation module is used for identifying each candidate keyword in the target structured document and calculating the weight value of each candidate keyword according to the word frequency-inverse text frequency index and the information entropy of each candidate keyword in the structured long document;
and the document keyword screening module is used for calculating score values corresponding to the candidate keywords according to the weight values of the candidate keywords and a preset text sorting algorithm, and screening the candidate keywords according to the score values to obtain the document keywords.
In a third aspect, an embodiment of the present invention provides an electronic device, where the electronic device includes:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,
the memory stores a computer program executable by the at least one processor, and the computer program is executed by the at least one processor to enable the at least one processor to execute the method for extracting the document keywords according to any embodiment of the present invention.
According to another aspect of the present invention, a computer-readable storage medium is provided, and computer instructions are stored in the computer-readable storage medium, and when the computer instructions are executed, a processor is configured to implement the method for extracting a document keyword according to any embodiment of the present invention.
According to the technical scheme of the embodiment of the invention, the unstructured long document is converted into the structured long document, the text is subjected to duplicate removal according to the Hash fingerprint corresponding to the text, each candidate keyword of the document after the duplicate removal is identified, the weight of each keyword is determined by calculating the word frequency-inverse text index and the information entropy of each candidate keyword, the document keywords are obtained by screening according to the weight and the calculation rule of the preset text sorting method, and the accuracy of extracting the keywords in the unstructured long document is improved.
It should be understood that the statements in this section are not intended to identify key or critical features of the embodiments of the present invention, nor are they intended to limit the scope of the invention. Other features of the present invention will become apparent from the following description.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1a is a flowchart of a method for extracting keywords from a document according to an embodiment of the present invention;
fig. 1b is a flowchart of a method for obtaining a hashed fingerprint according to an embodiment of the present invention;
FIG. 2a is a flowchart of a method for extracting keywords from a document according to a second embodiment of the present invention;
FIG. 2b is a diagram illustrating the conversion efficiency of the method provided by the second embodiment of the present invention for unstructured long documents of different sizes;
FIG. 2c is a schematic diagram illustrating the fluctuation of the deduplication ratio under different hamming distances according to the second embodiment of the present invention;
FIG. 2d is a schematic diagram of the fluctuation of the deduplication ratio under different similarity threshold conditions according to the second embodiment of the present invention;
FIG. 3 is a schematic structural diagram of an apparatus for extracting keywords from a document according to a third embodiment of the present invention;
fig. 4 is a schematic structural diagram of an electronic device that can be used to implement the fourth embodiment of the present invention.
Detailed Description
In order to make those skilled in the art better understand the technical solutions of the present invention, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Example one
Fig. 1a is a flowchart of a method for extracting document keywords according to an embodiment of the present invention, where the present embodiment is applicable to the case of extracting keywords in an unstructured long document, and the method may be implemented by a document keyword extraction device, which may be implemented in a form of hardware and/or software, and the document keyword extraction device may be configured in a terminal or a server carrying a document keyword extraction function. As shown in fig. 1a, the method comprises:
and S110, converting the unstructured long document to be processed into a structured long document.
Wherein the unstructured long document is a long document containing unstructured information. For example, the unstructured long file may be a file with a set format, and a plurality of characters are stored in the file. Furthermore, the unstructured information means that the form of the information is relatively unfixed, and the information is often files in various formats, and the covered information is wide and cannot be completely digitalized, and can be divided into: operational content such as contracts, letters and purchase records, etc.; department content, such as paperwork, spreadsheets, and email, etc.; web page content, such as information in HTML and XML (eXtensible Markup Language) format.
In the present embodiment, the division of the long document and the short document can be performed using the scale of the number of characters in the document. Specifically, if the number of characters included in a document exceeds a preset word number threshold, determining that the document is a long document; otherwise, the document is determined to be a short document.
The structured long document is a long document containing structured information; further, the structured information means that the information can be decomposed into a plurality of components which are related to each other after being analyzed, and each component has a clear hierarchical structure, and the use and maintenance of the components are managed through a database and have certain operation specifications. Further, records including production, business and transaction aspects belong to the structured information.
Further, converting the unstructured long document to be processed into a structured long document may include:
converting the unstructured long document into a semi-structured long document in an XML format by adopting an XML document generating tool; and converting the semi-structured long document into a structured long document by adopting an XML document analysis tool.
The semi-structured long document with the XML format has a layered format and can be applied to Web (World Wide Web) programs of different types, so that the file with the XML format is adopted to bear the incompletely structured data in the unstructured long document; furthermore, the XML document adopts a nested element structure, the root element is taken as a starting point, and all elements have marks associated with the elements. Each element except the nested element has an associated attribute and value.
The method comprises the steps of processing the XML file by using an XML file analysis tool, wherein the XML file is analyzed by using an SAX analysis tool, and the SAX analysis tool has the characteristics of high analysis efficiency, low memory occupation and applicability to mobile equipment. Furthermore, the essence of the SAX analysis tool is event-driven, an event mechanism is taken as a core, and the incomplete structured data borne by the XML file is effectively analyzed by combining a method callback technology.
S120, according to the Hash fingerprint of each text in the structured long document, text deduplication processing is carried out on the structured long document to obtain a target structured document, wherein each text comprises at least one sentence.
Based on the fact that the number of characters in the structured long document is large, that is, the number of sentences included in the structured long document is also large, and performing subsequent document keyword extraction based on the structured long document with such a size brings about a very large amount of computation, in the present embodiment, repeated texts included in the structured long document are first subjected to deduplication processing once to reduce the computation amount to the greatest extent.
The text is a part of a document, and may be, for example, a chapter, a paragraph, or a sentence. The person skilled in the art can select a suitable way to divide a plurality of texts in a structured long document according to actual needs.
A hashed fingerprint is understood to be an identification of a text, and a hashed fingerprint of a text can be determined by using the entire text content in the text. The Hash fingerprints of the two texts with similar contents are also similar, and the texts with very similar contents can be screened from the document to delete by comparing the similarity between the Hash fingerprints of different texts, so that the subsequent calculation amount is reduced.
The performing text deduplication processing on the long structured document according to the hash fingerprint of each text in the long structured document to obtain a target structured document may include:
dividing the structured document into a plurality of texts, and performing word segmentation processing on each sentence in each text to obtain a word segmentation set corresponding to each text;
calculating to obtain a hash fingerprint of each text according to the hash code value of each word in the word set corresponding to each text;
calculating the hamming distance between every two texts according to the hash fingerprint of each text; according to the Hamming distance between every two texts, text duplication removal processing is carried out on the structured long document to obtain a target structured document;
the calculating the hash fingerprint of each text according to the hash code value of each participle in the participle set corresponding to each text may include:
acquiring a target hash coding value of each participle in a participle set corresponding to a currently processed target text; accumulating the hash values of the same coding bits in each target hash coding value to obtain an accumulated hash coding value corresponding to a target text; and according to a preset threshold value, carrying out binarization processing on the hash value of each coding bit in the accumulated hash coding value to obtain the hash fingerprint of the target text.
In a specific example, the participle set corresponding to the target text a comprises participle 1 and participle 2, the target hash code value of participle 1 is {1, -1,1}, the target hash encoding value of participle 2 is { -1, -1}. The accumulated hash code value corresponding to the target text A is {0,2, 1,0} which is obtained by accumulating the hash values of the same code bit in each of the target hash code values. Assuming that the target hash code value corresponding to target text B is {155, -132,100, -121,53,0}, and the predetermined threshold value is 0, then the code bit having a hash value greater than 0 may be binarized to 1, and the code bit having a hash value less than or equal to 0 may be binarized to 0, and then the hash fingerprint of target text B may be obtained to be {1,0,1, 0}.
Wherein the hamming distance can be expressed as different numbers of the same position between two binary strings; <xnotran> , {1,1,1,0,0,0} {1,1,1,1,1,1} 3. </xnotran>
Further, performing text deduplication processing on the structured long document may include:
calculating the Hamming distance between different texts, and judging the similarity of every two texts in the structured document according to the Hamming distance between the two texts.
Wherein, the similarity and the hamming distance are related in inverse proportion, namely, the smaller the hamming distance is, the higher the similarity of two texts in the structured document is. Further, one of the two texts whose similarity value is higher than the set threshold value may be deleted.
For example, as shown in fig. 1b, after the structured document is obtained, the structured document needs to be divided into a plurality of texts, and a corresponding hash fingerprint is calculated for each text. Fig. 1b illustrates an example of a target text, and calculates a specific implementation manner of obtaining a hash fingerprint for the target text. First, the target text may be divided into one or more sentences, and then word segmentation processing is performed on each sentence, so as to finally obtain a plurality of word segments corresponding to the target text, i.e., word segment 1, word segment 2, \ 8230;, and word segment n. In the calculation process, the target hash code value of each participle in the participle set corresponding to the current processed target text needs to be calculated. Specifically, taking the participle 1 as an example, a pre-constructed dictionary may be used to perform dictionary coding on the participle 1, and after the dictionary coding is completed, the coded value is converted into a hash coded value, and in a specific implementation manner, 0 in the coded value may be replaced by-1 to obtain the hash coded value of the participle 1; and accumulating and calculating the hash values of the corresponding coded bits in all the hash coded values, thereby obtaining a sum vector. Then, each vector element of the sum vector is compared with a set threshold value, so that the sum vector can be subjected to binarization processing, for example, the vector elements greater than 0 are set to be 1, and the vector elements less than 0 are set to be 0, so as to obtain a corresponding hash fingerprint. S130, identifying each candidate keyword in the target structured document, and calculating the weight value of each candidate keyword according to the word frequency-inverse text frequency index and the information entropy of each candidate keyword in the structured long document.
Wherein, the word frequency-inverse text frequency index in the structured long document can be used for measuring the importance degree of a word to a document set. If a word appears in a document with a high frequency and appears in other documents with a low probability, the word has a strong ability to distinguish, and the word frequency-inverse text frequency index of the word is also higher.
The information entropy of the structured long text can be used for evaluating the disorder degree of the distribution of the subject terms; further, words with larger information entropy have more uniform distribution of subjects, i.e. higher confusion, and words with smaller information entropy indicate more concentrated distribution of subjects and lower confusion.
The weight value of each candidate word can be used for evaluating the importance degree of the candidate word relative to the structured document; further, the weight value is different from the general specific gravity, and the weight value can be used for evaluating the importance degree and contribution degree of the keyword relative to the structured document.
S140, calculating score values corresponding to the candidate keywords according to the weight values of the candidate keywords and a preset text sorting algorithm, and screening the candidate keywords according to the score values to obtain the document keywords.
In this embodiment, a TextRank algorithm may be used as a preset text sorting algorithm. In the prior art, when the TextRank algorithm is used, the same weight value is used for each candidate keyword. This is equivalent to considering the importance of each candidate keyword as being consistent. In fact, there is an inherent difference in the importance between different candidate keywords in a document. In this embodiment, the weight value of each candidate keyword is calculated in advance by considering the difference of the importance degrees of these candidate keywords. And then, the score of each candidate keyword is calculated according to the personalized weight value, and the document keywords can be more accurately screened from the candidate keywords.
Correspondingly, calculating a score value corresponding to each candidate keyword according to the weight value of each candidate keyword and a preset text sorting algorithm may include:
generating a candidate keyword graph according to the co-occurrence of different candidate keywords in a preset length window;
according to the candidate keyword graph and the weight value of each candidate keyword, the following formula is adopted:
Figure BDA0003876521030000091
iteratively obtaining score values corresponding to the candidate keywords;
wherein WS (V) i ) As candidate key words V i D is a predetermined damping factor, in (V) i ) For candidate keyword V in candidate keyword graph i Out (V) for each candidate keyword of the edge-Out node i ) For candidate keyword V in candidate keyword graph i Each candidate keyword of the edge node; w is a group of ji Is the transition probability between the candidate keyword i and the candidate keyword j;
wherein,
Figure BDA0003876521030000092
w (Vi) is a candidate keyword V i The weight value of (3).
Further, the method provided in this embodiment is a method for obtaining a score value of each candidate word through multiple iterations, and calculates a difference between the weight values of each candidate word in the current iteration and the last iteration, and if the difference is smaller than a preset threshold, for example, 0.001, the method is terminated, and the next step is performed. If the iteration number reaches a threshold of termination number, for example 300, the method is also terminated, and the next step is carried out; otherwise, the next iteration is performed. By repeating the above steps, the score value of each candidate keyword tends to be normal and stable, because the algorithm is converged finally, and the converged value is the score value.
The screening of the document keywords in the candidate keywords according to the score values may include:
and arranging the candidate keywords according to the final score value of each candidate keyword in the document, and assigning the maximum top N words as the keywords of the unstructured long document.
The technical scheme of the embodiment of the invention provides a document keyword extraction method, which comprises the steps of converting an unstructured long document into a structured long document, removing the duplicate of the text according to a hash fingerprint corresponding to the text, identifying each candidate keyword of the document after the duplicate removal, determining the weight of each keyword by calculating the word frequency-inverse text index and the information entropy of each candidate keyword, and screening to obtain the document keyword according to the weight and the calculation rule of a preset text sorting method, so that the accuracy of extracting the keyword in the unstructured long document is improved.
Example two
Fig. 2a is a flowchart of a method for extracting a document keyword according to a second embodiment of the present invention, and the present embodiment and the above embodiments are refinements of the steps of the above embodiments. In this embodiment, each candidate keyword in the target structured document is identified, and the weighted value of each candidate keyword is calculated according to the word frequency-inverse text frequency index and the information entropy of each candidate keyword in the structured long document, which is embodied as: acquiring word segmentation sets respectively corresponding to each text in a target structured document, and filtering stop words of the word segmentation sets; performing part-of-speech tagging on the participles in each participle set after the filtering processing is completed; and according to the part-of-speech tagging result, at least one part-of-speech of the appointed part-of-speech in each part-of-speech set is reserved as a candidate keyword. And calculating a first weight of each candidate keyword according to the word frequency-inverse text frequency index of each candidate keyword in the structured long document. And calculating a second weight of each candidate keyword according to the information entropy of each candidate keyword in the structured long document. And calculating the weight value of each candidate keyword according to the first weight coefficient, the second weight coefficient, the first weight and the second weight of each candidate keyword.
Accordingly, as shown in fig. 2, the method comprises:
and S210, converting the unstructured long document to be processed into a structured long document.
S220, according to the Hash fingerprint of each text in the structured long document, text deduplication processing is carried out on the structured long document to obtain a target structured document, wherein each text comprises at least one sentence.
S230, acquiring word segmentation sets respectively corresponding to each text in the target structured document, and filtering stop words of the word segmentation sets.
Wherein, the stop words are virtual words and non-retrieval words in computer retrieval; further, stop words include widely used and frequently used Chinese and English words and Chinese and English words in texts which occur frequently but have no practical meaning, including mood-assisted words, adverbs, prepositions, conjunctions, articles, and sound-making words.
And S240, performing part-of-speech tagging on the participles in each participle set after the filtering processing is completed.
It is easy to understand that the participles in the participle set after the filtering process are participles which do not include stop words; further, the word segmentation part of speech may include: real words such as nouns, pronouns and adjectives, verbs, numerals and quantifications;
the part-of-speech tagging is obtained by performing part-of-speech partitioning on the filtered part-of-speech in each part-of-speech set; furthermore, for Chinese texts, part-of-speech segmentation can be performed on the texts by using a word segmentation tool, for English texts, words are segmented on the documents according to spaces, word drying is performed to obtain word prototypes, and then part-of-speech tagging is performed.
And S250, according to the part-of-speech tagging result, retaining at least one part-of-speech of the specified part-of-speech in each part-of-speech set as a candidate keyword.
In the embodiment, it is considered that a noun, a verb or an adjective is mainly used as a document keyword, and furthermore, after the part-of-speech tagging result is obtained, only the parts of the noun, the verb and the adjective part-of-speech may be retained as candidate keywords.
S260, calculating a first weight of each candidate keyword according to the word frequency-inverse text frequency index of each candidate keyword in the structured long document.
In the present embodiment, the Term Frequency-inverse text Frequency index is calculated by using a TF-IDF (Term Frequency Document Frequency) Term Frequency technology. Further, the word frequency-inverse text frequency index is used to measure the importance of a word to a document set. The main idea is that if a word appears frequently in a document (i.e. has a high TF) and has a low probability of appearing in other documents (the IDF value is low), the word has a strong ability to distinguish. The calculation formula of the word frequency-inverse text frequency index is as follows:
Figure BDA0003876521030000121
Figure BDA0003876521030000122
wherein ti is the ith candidate keyword, and ni is the number of times that the candidate keyword ti appears in the document; TFti is the word frequency of the candidate keyword ti in all documents; sigma k k n The total number of all candidate keywords appearing in the document, and IDFi is the reverse probability of the candidate keywords ti; d is the number of all documents in the system, and Dw is the number of documents with the candidate keyword ti.
S270, calculating a second weight of each candidate keyword according to the information entropy of each candidate keyword in the structured long document.
The information entropy is used for evaluating the chaos degree of the distribution of the subject words; further, words with larger information entropy have more uniform topic distribution, i.e. higher confusion, and words with smaller information entropy indicate more concentrated topic distribution and lower confusion.
In this embodiment, information entropy may be obtained by calculating for each text to which each candidate keyword belongs, and the information entropy may be used as the information entropy of the candidate keyword. The calculation formula of the information entropy of the ith text wi is as follows:
Figure BDA0003876521030000123
wherein p (K = twi) is the frequency of occurrence of the candidate keyword t in the text Wi in the structured long document K, and K is the total number of the candidate keywords included in the text Wi.
After the information entropy of one text is obtained, the information entropy can be used as the information entropy of each candidate keyword in the text.
S280, calculating the weight value of each candidate keyword according to the first weight coefficient, the second weight coefficient, the first weight value and the second weight value of each candidate keyword.
Wherein, according to the first weight coefficient, the second weight coefficient, the first weight and the second weight of each candidate keyword, a certain candidate keyword w in the structured long document can be calculated i Of the initial weight value.
In a specific example, if the first weighting factor and the second weighting factor are both 1/2, the weighting value of the ith candidate keyword Wi may be represented as:
Figure BDA0003876521030000131
wherein, W T (i) Representing a first weight, W, determined by a word frequency-inverse text frequency index E (i) And the second weight value is determined by using the information entropy mean value.
And S290, calculating score values corresponding to the candidate keywords according to the weight values of the candidate keywords and a preset text sorting algorithm, and screening the candidate keywords according to the score values to obtain the document keywords.
The technical scheme of the embodiment of the invention provides a document keyword extraction method, which comprises the steps of converting an unstructured long document into a structured long document, removing the duplicate of the text according to a hash fingerprint corresponding to the text, identifying each candidate keyword of the document after the duplicate removal, determining the weight of each keyword by calculating the word frequency-inverse text index and the information entropy of each candidate keyword, and screening to obtain the document keyword according to the weight and the calculation rule of a preset text sorting method, so that the accuracy of extracting the keyword in the unstructured long document is improved.
Application specific scenarios
In order to more clearly express the technical scheme provided by the embodiment of the invention, the calculation scheme provided by the embodiment of the invention is used for carrying out sufficient experimental verification and is compared with the performances of other document keyword extraction methods for verification.
Illustratively, the method of the present embodiment is adopted to convert unstructured long documents with different sizes in the test object, and the XML document is used to carry the incomplete structured data in the method for converting the unstructured long documents of the test object indirectly. An unstructured long document with a size of 1.6 MB-1018M 9B is selected from the test object, and the conversion efficiency of the unstructured documents with different sizes into structured documents is shown in FIG. 2B.
As shown in fig. 2b, when the method of the present embodiment is used to convert an unstructured long document in a test object, the conversion time is proportional to the size of the unstructured long document, i.e. the larger the unstructured long document is, the longer the conversion time is. The fluctuation range of the analysis efficiency of the method is 0 MB/ms-700 MB/ms, and further, the method has higher efficiency when converting the unstructured long document in the test object into the structured document.
Further, a document is randomly selected from the converted structured document, after the conversion, the document includes 1174 pieces of text information, the method of the embodiment is adopted to perform deduplication processing on all the text information in the document, and under the condition of analyzing different hamming distances, the fluctuation condition of the deduplication rate of the technique of the embodiment is set, the fingerprint length and the deduplication threshold are respectively 128 and 70, and the obtained result is shown in fig. 2 c.
Analyzing fig. 2c, it is found that, as the hamming distance increases, the method of the present embodiment gradually increases the deduplication ratio of the text, i.e., the hamming distance is proportionally related to the deduplication ratio of the technique of the present embodiment. Under the condition that the Hamming distance is increased from 1 to 10, the de-weighting rate of the method of the embodiment on the text in the structured document in the test object is increased from 32% to 95%; under the condition that the Hamming distance is increased from 10 to 15, the duplication elimination rate of the text in the structured document in the test object is increased from 95% to 98% by the technology of the embodiment, so that the method of the embodiment has the characteristics of high duplication elimination rate and good duplication elimination effect.
Further, to analyze the influence of the similarity threshold on the deduplication ratio in the method of this embodiment, the hamming distance is set to 10, and the similarity thresholds are 40, 50, 60, 70, 80, and 90, respectively, and the obtained result is shown in fig. 2d by comparing the fluctuation of the deduplication ratio in the technique of this embodiment under the condition of different similarity thresholds.
Analyzing fig. 2d, it is found that, as the similarity threshold increases, the deduplication rate of the present embodiment also shows a gradually increasing trend. When the similarity threshold is increased from 40 to 70, the weight removal rate of the method is increased from 35% to 89%; when the similarity threshold is increased from 70 to 90, the de-weight rate of the method is increased from 89% to 94%, so that the method has the characteristics of high de-weight rate and good de-weight effect.
In the prior art, a method based on information fusion and a deep learning method are generally used for extracting keywords of an unstructured long document. For example, for the same text, feature words with different numbers are extracted, and compared with the method of the present embodiment by an information fusion-based method and a deep learning method, the obtained precision, recall rate and F1 value are shown in table 1:
TABLE 1
Figure BDA0003876521030000151
Further, the table 1 is analyzed to obtain that the precision, the recovery rate and the F1 value of the method of the embodiment all show a gradually increasing trend along with the increase of the number of the extracted keywords; the accuracy of the method based on information fusion is reduced along with the increase of the number of the keywords; the accuracy and F1 value of the deep learning method are reduced along with the increase of the number of the keywords. It is easy to understand that the keyword enhancement performance of the present embodiment is significantly better than that of the two comparison methods.
EXAMPLE III
Fig. 3 is a schematic structural diagram of an apparatus for extracting a document keyword according to a third embodiment of the present invention. As shown in fig. 3, the apparatus includes:
the document structure conversion module 310 is configured to convert an unstructured long document to be processed into a structured long document;
the document deduplication module 320 is configured to perform text deduplication processing on the long structured document according to the hash fingerprint of each text in the long structured document to obtain a target structured document, where each text includes at least one sentence;
the weighted value calculating module 330 is configured to identify each candidate keyword in the target structured document, and calculate a weighted value of each candidate keyword according to a word frequency-inverse text frequency index and an information entropy of each candidate keyword in the structured long document;
and the document keyword screening module 340 is configured to calculate a score value corresponding to each candidate keyword according to the weight value of each candidate keyword and a preset text sorting algorithm, and screen each candidate keyword according to the score value to obtain a document keyword.
The technical scheme of the embodiment of the invention provides a document keyword extraction method, which comprises the steps of converting an unstructured long document into a structured long document, removing the weight of the text according to a Hash fingerprint corresponding to the text, identifying each candidate keyword of the document after the weight is removed, determining the weight of each keyword by calculating the word frequency-inverse text index and the information entropy of each candidate keyword, screening to obtain the document keyword according to the weight and the calculation rule of a preset text sorting method, and improving the accuracy of extracting the keyword in the unstructured long document.
On the basis of the foregoing embodiments, the document structural transformation module 310 may be specifically configured to: converting the unstructured long document into a semi-structured long document in an XML format by adopting an XML document generating tool; and converting the semi-structured long document into a structured long document by adopting an XML document analysis tool.
On the basis of the foregoing embodiments, the document deduplication module 320 may specifically include:
the calculation unit is used for calculating the Hash fingerprint of each text according to the Hash code value of each participle in the participle set corresponding to each text; calculating the hamming distance between every two texts according to the hash fingerprint of each text;
the accumulation unit is used for accumulating the hash values of the same coding bits in the target hash coding values to obtain an accumulated hash coding value corresponding to the target text;
and the binarization unit is used for carrying out binarization processing on the hash values of all the coding bits in the accumulated hash coding values according to a preset threshold value to obtain the hash fingerprints of the target text.
On the basis of the foregoing embodiments, the weight value calculating module 330 may specifically include:
the filtering unit is used for acquiring word segmentation sets respectively corresponding to each text in the target structured document and filtering stop words of the word segmentation sets; performing part-of-speech tagging on the participles in each participle set after filtering; and according to the part-of-speech tagging result, at least one part-of-speech of the appointed part-of-speech in each part-of-speech set is reserved as a candidate keyword.
The calculation unit is used for calculating a first weight of each candidate keyword according to the word frequency-inverse text frequency index of each candidate keyword in the structured long document; calculating a second weight of each candidate keyword according to the information entropy of each candidate keyword in the structured long document; and calculating the weight value of each candidate keyword according to the first weight coefficient, the second weight coefficient, the first weight and the second weight of each candidate keyword.
On the basis of the foregoing embodiments, the document keyword screening module 340 may be specifically configured to:
generating a candidate keyword graph according to the co-occurrence of different candidate keywords in a preset length window; according to the candidate keyword graph and the weight value of each candidate keyword, the following formula is adopted:
Figure BDA0003876521030000171
iteratively obtaining score values corresponding to the candidate keywords;
wherein WS (V) i ) As candidate key words V i D is a predetermined damping factor, in (V) i ) For candidate keyword V in candidate keyword graph i For each candidate keyword of the edge node, out (V) j ) For candidate keyword V in candidate keyword graph i The candidate keywords are the edge nodes; w ji Is the transition probability between the candidate keyword i and the candidate keyword j;
wherein,
Figure BDA0003876521030000181
w (Vi) is a candidate keyword V i The weight value of (2).
The device for extracting the document keywords provided by the embodiment of the invention can execute the method for extracting the document keywords provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.
Example four
FIG. 4 shows a schematic block diagram of an electronic device 10 that may be used to implement an embodiment of the invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.
As shown in fig. 4, the electronic device 10 includes at least one processor 11, and a memory communicatively connected to the at least one processor 11, such as a Read Only Memory (ROM) 12, a Random Access Memory (RAM) 13, and the like, wherein the memory stores a computer program executable by the at least one processor, and the processor 11 can perform various suitable actions and processes according to the computer program stored in the Read Only Memory (ROM) 12 or the computer program loaded from a storage unit 18 into the Random Access Memory (RAM) 13. In the RAM 13, various programs and data necessary for the operation of the electronic apparatus 10 can also be stored. The processor 11, the ROM 12, and the RAM 13 are connected to each other via a bus 14. An input/output (I/O) interface 15 is also connected to the bus 14.
A number of components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16 such as a keyboard, a mouse, or the like; an output unit 17 such as various types of displays, speakers, and the like; a storage unit 18 such as a magnetic disk, optical disk, or the like; and a communication unit 19 such as a network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.
The processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The processor 11 performs the various methods and processes described above, such as the extraction method of the document keywords.
Namely: converting an unstructured long document to be processed into a structured long document;
according to the Hash fingerprint of each text in the structured long document, carrying out text deduplication processing on the structured long document to obtain a target structured document, wherein each text comprises at least one sentence;
identifying each candidate keyword in the target structured document, and calculating the weight value of each candidate keyword according to the word frequency-inverse text frequency index and the information entropy of each candidate keyword in the structured long document;
and calculating score values corresponding to the candidate keywords according to the weight values of the candidate keywords and a preset text sorting algorithm, and screening the candidate keywords according to the score values to obtain the document keywords.
In some embodiments, the method of extracting the document keywords may be implemented as a computer program tangibly embodied in a computer-readable storage medium, such as storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19. When the computer program is loaded into the RAM 13 and executed by the processor 11, one or more steps of the above-described document keyword extraction method may be performed. Alternatively, in other embodiments, the processor 11 may be configured to perform the method of extracting the document keywords by any other suitable means (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
Computer programs for implementing the methods of the present invention can be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be performed. A computer program can execute entirely on a machine, partly on a machine, as a stand-alone software package partly on a machine and partly on a remote machine or entirely on a remote machine or server.
In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. A computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the Internet.
The computing system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service are overcome.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present invention may be executed in parallel, sequentially, or in different orders, and are not limited herein as long as the desired results of the technical solution of the present invention can be achieved.
The above-described embodiments should not be construed as limiting the scope of the invention. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A method for extracting a document keyword is characterized by comprising the following steps:
converting the unstructured long document to be processed into a structured long document;
according to the Hash fingerprint of each text in the structured long document, carrying out text deduplication processing on the structured long document to obtain a target structured document, wherein each text comprises at least one sentence;
identifying each candidate keyword in the target structured document, and calculating the weight value of each candidate keyword according to the word frequency-inverse text frequency index and the information entropy of each candidate keyword in the structured long document;
and calculating score values corresponding to the candidate keywords according to the weight values of the candidate keywords and a preset text sorting algorithm, and screening the candidate keywords according to the score values to obtain the document keywords.
2. The method of claim 1, wherein converting the unstructured long document to be processed into a structured long document comprises:
converting the unstructured long document into a semi-structured long document in an XML format by adopting an XML document generating tool;
and converting the semi-structured long document into a structured long document by adopting an XML (extensive Makeup language) file analysis tool.
3. The method of claim 1, wherein performing text deduplication processing on the long structured document according to the hash fingerprint of each text in the long structured document to obtain a target structured document comprises:
dividing the structured document into a plurality of texts, and performing word segmentation processing on each sentence in each text to obtain a word segmentation set corresponding to each text;
calculating to obtain a hash fingerprint of each text according to the hash code value of each word in the word set corresponding to each text;
calculating the hamming distance between every two texts according to the hash fingerprint of each text;
and according to the Hamming distance between every two texts, carrying out text duplication elimination processing on the structured long document to obtain the target structured document.
4. The method of claim 3, wherein calculating the hash fingerprint of each text according to the hash code values of the participles in the participle set corresponding to each text comprises:
acquiring a target hash coding value of each participle in a participle set corresponding to a currently processed target text;
accumulating the hash values of the same coding bits in each target hash coding value to obtain an accumulated hash coding value corresponding to a target text;
and according to a preset threshold value, performing binarization processing on the hash value of each coding bit in the accumulated hash coding value to obtain the hash fingerprint of the target text.
5. The method of any of claims 1-4, wherein identifying candidate keywords in the target structured document comprises:
acquiring word segmentation sets respectively corresponding to each text in a target structured document, and filtering stop words of the word segmentation sets;
performing part-of-speech tagging on the participles in each participle set after the filtering processing is completed;
and according to the part-of-speech tagging result, at least one part-of-speech of the appointed part-of-speech in each part-of-speech set is reserved as a candidate keyword.
6. The method according to any one of claims 1 to 4, wherein calculating the weight value of each candidate keyword according to the word frequency-inverse text frequency index and the information entropy of each candidate keyword in the structured long document comprises:
calculating a first weight of each candidate keyword according to the word frequency-inverse text frequency index of each candidate keyword in the structured long document;
calculating a second weight of each candidate keyword according to the information entropy of each candidate keyword in the structured long document;
and calculating the weight value of each candidate keyword according to the first weight coefficient, the second weight coefficient, the first weight value and the second weight value of each candidate keyword.
7. The method according to any one of claims 1 to 4, wherein calculating a score value corresponding to each candidate keyword according to a weight value of each candidate keyword and a preset text sorting algorithm comprises:
generating a candidate keyword graph according to the co-occurrence of different candidate keywords in a preset length window;
according to the candidate keyword graph and the weight value of each candidate keyword, the following formula is adopted:
Figure FDA0003876521020000031
iteratively obtaining score values corresponding to the candidate keywords;
wherein WS (V) i ) As candidate key words V i D is a predetermined damping factor, in (V) i ) For candidate keyword V in candidate keyword graph i For each candidate keyword of the edge node, out (V) j ) For candidate keyword V in candidate keyword graph i The candidate keywords are the edge nodes; w is a group of ji Is the transition probability between the candidate keyword i and the candidate keyword j;
wherein,
Figure FDA0003876521020000032
w (Vi) is a candidate keyword V i The weight value of (2).
8. An apparatus for extracting a keyword from a document, comprising:
the document structuralization conversion module is used for converting the unstructured long document to be processed into a structured long document;
the document deduplication module is used for performing text deduplication processing on the long structured document according to the hash fingerprint of each text in the long structured document to obtain a target structured document, wherein each text comprises at least one sentence;
the weight value calculation module is used for identifying each candidate keyword in the target structured document and calculating the weight value of each candidate keyword according to the word frequency-inverse text frequency index and the information entropy of each candidate keyword in the structured long document;
and the document keyword screening module is used for calculating score values corresponding to the candidate keywords according to the weight values of the candidate keywords and a preset text sorting algorithm, and screening the candidate keywords according to the score values to obtain the document keywords.
9. An electronic device, characterized in that the electronic device comprises:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,
the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the method of extracting document keywords according to any one of claims 1 to 7.
10. A computer-readable storage medium, wherein the computer-readable storage medium stores computer instructions for causing a processor to implement the method for extracting the document keyword according to any one of claims 1 to 7 when executed.
CN202211216904.9A 2022-09-30 2022-09-30 Document keyword extraction method and device, electronic equipment and storage medium Pending CN115577082A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211216904.9A CN115577082A (en) 2022-09-30 2022-09-30 Document keyword extraction method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211216904.9A CN115577082A (en) 2022-09-30 2022-09-30 Document keyword extraction method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN115577082A true CN115577082A (en) 2023-01-06

Family

ID=84583246

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211216904.9A Pending CN115577082A (en) 2022-09-30 2022-09-30 Document keyword extraction method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115577082A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117151051A (en) * 2023-09-18 2023-12-01 上海鸿翼软件技术股份有限公司 Document processing method, device, equipment and medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117151051A (en) * 2023-09-18 2023-12-01 上海鸿翼软件技术股份有限公司 Document processing method, device, equipment and medium

Similar Documents

Publication Publication Date Title
EP3819785A1 (en) Feature word determining method, apparatus, and server
CN105975459B (en) A kind of the weight mask method and device of lexical item
CN113407679B (en) Text topic mining method and device, electronic equipment and storage medium
CN104967558B (en) A kind of detection method and device of spam
CN112559747A (en) Event classification processing method and device, electronic equipment and storage medium
CN106569989A (en) De-weighting method and apparatus for short text
CN112650910A (en) Method, device, equipment and storage medium for determining website update information
CN113806660A (en) Data evaluation method, training method, device, electronic device and storage medium
CN115577082A (en) Document keyword extraction method and device, electronic equipment and storage medium
CN114036921A (en) Policy information matching method and device
CN112560425A (en) Template generation method and device, electronic equipment and storage medium
CN109344397B (en) Text feature word extraction method and device, storage medium and program product
CN116561320A (en) Method, device, equipment and medium for classifying automobile comments
CN115952258A (en) Generation method of government affair label library, and label determination method and device of government affair text
CN110852078A (en) Method and device for generating title
CN107590163B (en) The methods, devices and systems of text feature selection
CN115048523A (en) Text classification method, device, equipment and storage medium
CN113792546A (en) Corpus construction method, apparatus, device and storage medium
CN109597879B (en) Service behavior relation extraction method and device based on &#39;citation relation&#39; data
CN109684442B (en) Text retrieval method, device, equipment and program product
Martín-del-Campo-Rodríguez et al. Unsupervised authorship attribution using feature selection and weighted cosine similarity
JP5642229B2 (en) Importance determination system, importance determination method, and computer program
TWI534640B (en) Chinese network information monitoring and analysis system and its method
Smith et al. Classification of text to subject using LDA
CN113127639A (en) Abnormal session text detection method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination