CN109902290B - Text information-based term extraction method, system and equipment - Google Patents

Text information-based term extraction method, system and equipment Download PDF

Info

Publication number
CN109902290B
CN109902290B CN201910063975.1A CN201910063975A CN109902290B CN 109902290 B CN109902290 B CN 109902290B CN 201910063975 A CN201910063975 A CN 201910063975A CN 109902290 B CN109902290 B CN 109902290B
Authority
CN
China
Prior art keywords
word
text
node
words
term
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910063975.1A
Other languages
Chinese (zh)
Other versions
CN109902290A (en
Inventor
杜翠凤
沈文明
周冠宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Jiesai Communication Planning And Design Institute Co ltd
GCI Science and Technology Co Ltd
Original Assignee
Guangzhou Jiesai Communication Planning And Design Institute Co ltd
GCI Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Jiesai Communication Planning And Design Institute Co ltd, GCI Science and Technology Co Ltd filed Critical Guangzhou Jiesai Communication Planning And Design Institute Co ltd
Priority to CN201910063975.1A priority Critical patent/CN109902290B/en
Publication of CN109902290A publication Critical patent/CN109902290A/en
Application granted granted Critical
Publication of CN109902290B publication Critical patent/CN109902290B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Machine Translation (AREA)

Abstract

The invention discloses a text information-based term extraction method, which comprises the following steps: acquiring a text to be processed, and preprocessing the text to be processed; extracting words meeting mutual information judgment indexes and context-dependent judgment indexes from the text to be processed, and recording the words into a seed word set; constructing a seed word network based on nodes of the seed word set and edges of the nodes; defining the weight of the node, and iterating the weight of the node through a preset model until the weight of the node converges; and sorting the weight values of the nodes, and extracting adjacent phrases as candidate terms when the seed words which are arranged in sequence form the adjacent phrases. The invention also discloses a text information-based term extraction system and a text information-based term extraction device. The embodiment of the invention can fully consider the problem of Chinese grammar level, has the characteristics of automation and dynamic update, and meets the requirement of high-speed extraction of modern massive text terms.

Description

Text information-based term extraction method, system and equipment
Technical Field
The present invention relates to the field of language identification technologies, and in particular, to a method, a system, and an apparatus for extracting terms based on text information.
Background
The term automated extraction research has become a research hotspot problem in the field of natural language. The term automatic extraction method in the prior art specifically comprises the following steps: firstly, extracting a seed word method of a text by utilizing mutual information and context dependence; then, splicing the words by combining a word frequency method to form a compound word in the key field; finally, quantitatively measuring the association degree between terms by adopting the field consistency, the field correlation degree and the field membership degree. The seed word extraction method based on mutual information, context dependence and information entropy takes text frequent words as reference points, synthesizes text seed words in a forward or backward splicing mode, and has high completeness of extracted terms, but the method does not consider the problem of Chinese grammar level, and can cause a large number of non-domain compound words or terms. In addition, although the term extraction method of the field consistency, the field relevance and the field membership can be used for better extracting the compound words and the terms of the field, the threshold value of each index is difficult to find an optimal value.
Disclosure of Invention
The embodiment of the invention aims to provide a text information-based term extraction method, a text information-based term extraction system and text information-based term extraction equipment, which can fully consider the problem of Chinese grammar level, have the characteristics of automation and dynamic update, and meet the requirement of high-speed extraction of modern massive text terms.
In order to achieve the above object, an embodiment of the present invention provides a text information-based term extraction method, including:
acquiring a text to be processed, and preprocessing the text to be processed;
extracting words meeting mutual information judgment indexes and context-dependent judgment indexes from the text to be processed, and recording the words into a seed word set;
constructing a seed word network based on nodes of the seed word set and edges of the nodes; the node is any seed word in the seed word set, and the edge of the node is the seed word adjacent to the current node;
defining the weight of the node, and iterating the weight of the node through a preset model until the weight of the node converges;
sorting the weight values of the nodes, and extracting adjacent phrases as candidate terms when the seed words arranged in sequence form the adjacent phrases; wherein the adjacent phrases meet a preset term rule.
Compared with the prior art, the term extraction method based on text information disclosed by the invention has the advantages that firstly, on the basis of preprocessing, the predicted seed words are mined by adopting the judging indexes and the context-dependent judging indexes and are recorded in a seed word set; then, constructing a seed word network based on the nodes of the seed word set and the edges of the nodes, and iterating the weight of the nodes by adopting an algorithm of a preset model to enable the weight to be converged; and finally, sorting the weight values of the nodes, and extracting adjacent phrases as candidate terms when the seed words arranged in sequence form the adjacent phrases. The method for extracting the text information-based terms solves the problem that the Chinese grammar level is not considered in the prior art to extract a large number of non-domain compound words or terms, can fully consider the problem of the Chinese grammar level, has the characteristics of automation and dynamic updating, and meets the requirement of high-speed extraction of modern massive text terms.
As an improvement of the above solution, after extracting the adjacent phrase as a candidate term, the method further includes:
calculating the support and the confidence of the candidate term in a database; the database comprises a plurality of words in a preset field;
and when the candidate term belongs to the preset domain, extracting the candidate term to form a term dictionary of the preset domain.
As an improvement of the above scheme, the preprocessing the text to be processed specifically includes:
carrying out minimum unit division of words on the text to be processed by utilizing an hanlp word segmentation system; the minimum unit represents a single word which can be divided into the text to be processed under the current word segmentation system.
As an improvement of the above scheme, the mutual information judgment index satisfies the following formula:
Figure BDA0001955078770000031
wherein the word string s=t 1 t 2 …t i ,t i A word or a word combination segmented by the hanlp segmentation system; f (t) i ) Representing t i The frequency of occurrence; n is n i Is the number of times the word string S appears, N i Is the number of occurrences of all words in the database.
As an improvement of the above-described scheme, the context-dependent determination index satisfies the following formula:
H(W|t i )=-∑ w∈W p(w|t i )*log 2 p(w|t i ) Formula (2);
wherein w represents t within a particular window i Probability of a particular word appearing again in the case of occurrence; w is expressed as t in a particular window i A set of all the specific words reappears in the case of appearance; the specific window is a window with a specific length for the text to be processed, and the window with the specific length contains a plurality of words.
As an improvement of the above solution, the defining the weight of the node, and iterating the weight of the node through a preset model until the weight of the node converges specifically includes:
defining the weight of the node by adopting semantic relevance; wherein the semantic relevance satisfies the following formula:
Figure BDA0001955078770000032
wherein w is ij Is the word t i And t j Semantic relevance between nodes represents the importance degree of edge connection between nodes;
iterating the weight of the node through the Textrank model until the weight of the node converges; wherein the iterative process satisfies the following formula:
Figure BDA0001955078770000033
wherein WS (t) i ) Representing node t i Is of importance of (a); d represents a damping coefficient, typically less than 1; t is t j ∈In(t i ) The representation being the word t i Following word t j Afterwards; t is t k ∈Out(t j ) Representation word t k Following word t j Afterwards; WS (t) j ) Representing node t j Is of importance of (a); w (w) jk Is the word t j And t k Semantic relatedness between them.
As an improvement of the above scheme, the extracting the adjacent phrase as a candidate term specifically includes:
and extracting the adjacent phrases by using the sliding window as candidate terms.
To achieve the above object, an embodiment of the present invention further provides a text information-based term extraction system, including:
the text pretreatment unit is used for obtaining the text to be treated and carrying out pretreatment on the text to be treated;
the seed word set recording unit is used for extracting words meeting mutual information judgment indexes and context-dependent judgment indexes from the text to be processed and recording the words into a seed word set;
a seed word network construction unit, configured to construct a seed word network based on nodes of the seed word set and edges of the nodes; the node is any seed word in the seed word set, and the edge of the node is the seed word adjacent to the current node;
the convergence unit is used for defining the weight of the node and iterating the weight of the node through a preset model until the weight of the node converges;
the candidate term extraction unit is used for sequencing the weights of the nodes, and extracting adjacent phrases as candidate terms when the seed words which are sequentially arranged form the adjacent phrases; wherein the adjacent phrases meet a preset term rule.
Compared with the prior art, the term extraction system based on text information disclosed by the invention has the advantages that firstly, on the basis of preprocessing by a text preprocessing unit to be processed, a seed word set recording unit adopts a judging index and a context dependent judging index to record expected seed words into a seed word set; then, a seed word network construction unit constructs a seed word network based on the nodes of the seed word set and the edges of the nodes, and a convergence unit adopts an algorithm of a preset model to iterate the weight of the nodes to enable the weight of the nodes to be converged; and finally, the candidate term extraction unit sorts the weights of the nodes, and when the seed words which are arranged in sequence form adjacent phrases, the adjacent phrases are extracted as candidate terms. The text information-based term extraction system disclosed by the invention can fully consider the problem of Chinese grammar level, has the characteristics of automation and dynamic updating, and meets the requirement of high-speed extraction of modern massive text terms.
As an improvement of the above solution, the system further comprises:
a support and confidence calculating unit for calculating the support and confidence of the candidate term in the database; the database comprises a plurality of words in a preset field;
and the term dictionary generating unit is used for extracting the candidate terms to form a term dictionary of the preset domain when the candidate terms belong to the preset domain.
To achieve the above object, an embodiment of the present invention further provides a text information based term extraction device, including a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, the processor implementing the text information based term extraction method according to any one of the embodiments described above when executing the computer program.
Drawings
Fig. 1 is a flowchart of a text information-based term extraction method according to an embodiment of the present invention;
fig. 2 is a schematic diagram of a seed word network in a term extraction method based on text information according to an embodiment of the present invention;
FIG. 3 is another flow chart of a text information based term extraction method provided by an embodiment of the present invention;
FIG. 4 is a block diagram of a text-based term extraction system 10 according to an embodiment of the present invention;
fig. 5 is a block diagram of a text information based term extracting device 20 according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Example 1
Referring to fig. 1, fig. 1 is a flowchart of a text information-based term extraction method according to an embodiment of the present invention; comprising the following steps:
s1, acquiring a text to be processed, and preprocessing the text to be processed;
s2, extracting words meeting mutual information judgment indexes and context-dependent judgment indexes from the text to be processed, and recording the words into a seed word set;
s3, constructing a seed word network based on the nodes of the seed word set and the edges of the nodes; the node is any seed word in the seed word set, and the edge of the node is the seed word adjacent to the current node;
s4, defining the weight of the node, and iterating the weight of the node through a preset model until the weight of the node converges;
s5, sorting the weight values of the nodes, and extracting adjacent phrases as candidate terms when the seed words arranged in sequence form the adjacent phrases; wherein the adjacent phrases meet a preset term rule.
Specifically, in step S1, the text to be processed is unstructured text, and the unstructured text may be several words, several sentences, or an article.
Preferably, the preprocessing the text to be processed specifically includes: carrying out minimum unit division of words on the text to be processed by utilizing an hanlp word segmentation system; the minimum unit represents a single word which can be divided into the text to be processed under the current word segmentation system. The minimum units of division for the same word are different according to different dictionaries. Such as cloud computing, may be partitioned into "cloud/computing" using bargaining, and "cloud computing" if other custom dictionaries are used. The minimum unit is a word which can be divided under the current tool.
Specifically, in step S2, the conventional mutual information calculation method weakens the probability of occurrence in the word combination or word re-prediction, so that the probability influence coefficient of occurrence of the word needs to be considered when calculating the mutual information. The mutual information judgment index satisfies the following formula:
Figure BDA0001955078770000061
wherein the word string s=t 1 t 2 …t i ,t i A word or a word combination segmented by the hanlp segmentation system; f (t) i ) Representing t i The frequency of occurrence; n is n i Is the number of times the word string S appears, N i Is the number of occurrences of all words in the database.
Context-dependent refers to up and down Wen Ciyu t within a particular window i Conditional entropy in the case that has occurred, the context dependent decision index satisfies the following formula:
H(W|t i )=-∑ w∈W p(w|t i )*log 2 p(w|t i ) Formula (2);
wherein w is represented in a particular windowInner t i Probability of a particular word appearing again in the case of occurrence; w is expressed as t in a particular window i A set of all the specific words reappears in the case of appearance; the specific window is a window with a specific length for the text to be processed, and the window with the specific length contains a plurality of words. The benefit of setting the specific window is that misjudgement of some specific word combinations as terms is largely precluded.
For example, if the text in the specific window is "each program", then t is called i That is, "one segment" is the probability that the word "program" appears after the word "one segment" appears, and w is the probability that the "program" appears again when "one segment" appears within a specific window. In the whole corpus, after a section appears, specific words such as a program, a road surface, a telephone, a silk belt and the like may appear, and all the specific words do not include a section, so that all the specific words are a set that another word appears when a certain word appears, namely a set of specific condition states.
Specifically, a threshold value of mutual information and context dependence is set according to the corpus, and if the word or word combination meets the threshold value, the word or word combination is recorded into the seed word set.
Specifically, in step S3, referring to fig. 2, fig. 2 is a schematic diagram of a seed word network in a term extraction method based on text information according to an embodiment of the present invention; and forming a seed word network G= (V, E) by a node V of the seed word set and an edge E between the nodes, wherein the node is any seed word in the seed word set (such as an algorithm in fig. 2), the edge of the node is a seed word adjacent to the current node (such as an edge of the algorithm in fig. 2 comprises an unsupervised, a neural network and an intelligent), namely the edge is 1 or any equal constant.
Specifically, in step S4, the mutual information and the context in the above steps rely on the statistics-too-focused index to measure the features of the words, and the semantic features between the words are not reflected from the semantic level.
Aiming at the problems, the embodiment of the invention firstly adopts semantic relevance to define the weight of the node; the node semantic relevance means the probability of simultaneous occurrence of the seed words, which accords with the assumption of the ebedding method, namely has similar context, and judges whether the seed words belong to the same category by quantitatively measuring semantic hierarchical relations among the seed words; word vectors trained by the text corpus-based emmbedding method have semantic correlation, so that the features of semantic correlation are reflected by adopting similarity among vectors on the basis of word2vec training pretreatment on each corpus; wherein the semantic relevance satisfies the following formula:
Figure BDA0001955078770000081
wherein w is ij Is the word t i And t j Semantic relevance between nodes represents the importance degree of edge connection between nodes;
then iterating the weight of the node through the Textrank model until the weight of the node converges; wherein the iterative process satisfies the following formula:
Figure BDA0001955078770000082
wherein WS (t) i ) Representing node t i Is of importance of (a); d represents a damping coefficient, typically less than 1; t is t j ∈In(t i ) The representation being the word t i Following word t j Afterwards; t is t k ∈Out(t j ) Representation word t k Following word t j Afterwards; WS (t) j ) Representing node t j Is of importance of (a); w (w) jk Is the word t j And t k Semantic relativity between; and continuously iterating according to the word ordering rule of the corpus until the stopping condition is met.
Specifically, in step S5, the weights of the nodes are ranked by Top-N to obtain Top-N seed words; if adjacent word groups are formed among the Top-N seed words, the adjacent word groups are extracted as terms. The method reflects semantic features among words constituting the term from a semantic level, and can reduce interference of irrelevant word combinations to a certain extent.
Preferably, the adjacent phrase is extracted as a candidate term by using a sliding window. For example, a paragraph "set mutual information and context dependent threshold according to corpus", if the word or word combination meets the above threshold, it is included in the seed node set ", and at this time, the seed word of Top-N is extracted by: "corpus", "set", "mutual information", "context", "dependency", "threshold", "word combination", "seed node", "collection". Starting sliding with a window of length 6, slowly sliding the window from left "with" word to right "with" word, if the framed word is inside the seed word set of Top-N, then it is used as candidate term set (e.g. seed node set, context dependency), if not, then the seed word is used as candidate term set (corpus, mutual information).
Preferably, the preset term rule is a chinese term rule shown in table 1, wherein the definitive phrase includes: adjectives, distinguishing this, verbs, nouns, and the number + adjectives.
TABLE 1 Chinese term rules
Part-of-speech phrase Template
baseNP Verb+noun
baseNP baseNP+baseNP
baseNP baseNP+ noun
baseNP Definitive sign + baseNP
baseNP Limiting the idioms + nouns
Further, after extracting the candidate term, the method further includes step S6: calculating the support and the confidence of the candidate term in a database; the database comprises a plurality of words in a preset field; and when the candidate term belongs to the preset domain, extracting the candidate term to form a term dictionary of the preset domain.
Wherein the support reveals the term m i And m is equal to j The probability of simultaneous occurrence is expressed as:
Support(m i ->m j )=P(m i ∪m j ) Equation (5);
the confidence level reveals the term m i After appearance, the term m j Whether or not or how likely it is that it will appear, the formula is:
Confidence(m i ->m j )=P(m i |m j ) Equation (6);
and calculating the support degree and the confidence degree of each candidate term in the specific field through the formula (5) and the formula (6), comparing the support degree and the confidence degree with the set minimum support degree and the set minimum confidence degree, excluding the candidate terms with the support degree less than the minimum support degree and the confidence degree, and finally forming the Chinese term dictionary in the specific field.
The acquisition of the association rule is mainly to find out the frequent mode of the minimum support degree Minsup and the minimum confidence degree Minconf meeting certain conditions from a large number of event record databases by a data mining method. After candidate terms are found, the embodiment of the invention calculates the support degree and the confidence degree of each candidate term in the preset field, compares the minimum support degree and the confidence degree of the terms in the preset field, excludes a large number of non-field candidate terms, and finally forms the Chinese dictionary in the preset field. The preset domain may be a specific domain which is set arbitrarily, and terms in different domains have different confidence degrees and support degrees, which is not limited in particular by the present invention.
Further, the process of steps S1 to S6 may refer to fig. 3.
When the method is implemented, firstly, on the basis of preprocessing, the predicted seed words are mined by adopting the judging indexes and the context-dependent judging indexes and are recorded in a seed word set; then, constructing a seed word network based on the nodes of the seed word set and the edges of the nodes, and iterating the weight of the nodes by adopting an algorithm of a preset model to enable the weight to be converged; and finally, sorting the weight values of the nodes, and extracting adjacent phrases as candidate terms when the seed words arranged in sequence form the adjacent phrases.
Compared with the prior art, the text information-based term extraction method disclosed by the invention solves the problem that a large number of non-domain compound words or terms are extracted due to no consideration of Chinese grammar levels in the prior art, can fully consider the problem of Chinese grammar levels, has the characteristics of automation and dynamic update, and meets the requirement of high-speed extraction of modern massive text terms.
Example two
Referring to fig. 4, fig. 4 is a block diagram illustrating a text-based term extraction system 10 according to an embodiment of the present invention; comprising the following steps:
a text pretreatment unit 1 to be treated, which is used for obtaining a text to be treated and pretreating the text to be treated;
a seed word set recording unit 2, configured to extract words satisfying the mutual information judgment index and the context dependent judgment index from the text to be processed, and record the words into a seed word set;
a seed word network construction unit 3, configured to construct a seed word network based on nodes of the seed word set and edges of the nodes; the node is any seed word in the seed word set, and the edge of the node is the seed word adjacent to the current node;
a convergence unit 4, configured to define a weight of the node, and iterate the weight of the node through a preset model until the weight of the node converges;
a candidate term extraction unit 5, configured to sort weights of the nodes, and extract adjacent phrases as candidate terms when the seed words arranged in sequence form the adjacent phrases; wherein the adjacent phrases meet a preset term rule.
Preferably, the text-based term extracting system 10 further includes:
a support and confidence calculation unit 6 for calculating the support and confidence of the candidate term in the database; the database comprises a plurality of words in a preset field;
a term dictionary generating unit 7 for extracting term dictionaries of which the candidate terms constitute a preset domain when the candidate terms belong to the preset domain.
Preferably, the text preprocessing unit 1 performs minimum unit division of words on the text to be processed by using a hanlp word segmentation system; the minimum unit represents a single word which can be divided into the text to be processed under the current word segmentation system.
Preferably, the mutual information judgment index satisfies the following formula:
Figure BDA0001955078770000111
wherein the word string s=t 1 t 2 …t i ,t i A word or a word combination segmented by the hanlp segmentation system; f (t) i ) Representing t i The frequency of occurrence; n is n i Is the number of times the word string S appears, N i Is the number of occurrences of all words in the database.
Preferably, the context dependent decision index satisfies the following formula:
H(W|t i )=-∑ w∈W p(w|t i )*log 2 p(w|t i ) Formula (2);
wherein w represents t within a particular window i Probability of a particular word appearing again in the case of occurrence; w is expressed as t in a particular window i A set of all the specific words reappears in the case of appearance; the specific window is a window with a specific length for the text to be processed, and the window with the specific length contains a plurality of words.
Preferably, the convergence unit 4 defines the weight of the node by adopting semantic relevance; wherein the semantic relevance satisfies the following formula:
Figure BDA0001955078770000121
wherein w is ij Is the word t i And t j Semantic relevance between nodes represents the importance degree of edge connection between nodes;
the convergence unit 4 iterates the weight of the node through the Textrank model until the weight of the node converges; wherein the iterative process satisfies the following formula:
Figure BDA0001955078770000122
wherein WS (t) i ) Representing node t i Is of importance of (a); d represents a damping coefficient, typically less than 1; t is t j ∈In(t i ) The representation being the word t i Following word t j Afterwards; t is t k ∈Out(t j ) Representation word t k Following word t j Afterwards; WS (t) j ) Representing node t j Is of importance of (a); w (w) jk Is the word t j And t k Semantic relatedness between them.
Preferably, the candidate term extraction unit 5 extracts the adjacent phrase as a candidate term using a sliding window.
The working process of each unit is referred to the working process of steps S1 to S6 in the above embodiment, and will not be described herein.
When the method is implemented, firstly, on the basis of preprocessing by the text preprocessing unit 1 to be processed, the seed word set recording unit 2 adopts the judging index and the context dependent judging index to mine the expected seed word to be recorded in the seed word set; then, the seed word network construction unit 3 constructs a seed word network based on the nodes of the seed word set and the edges of the nodes, and the convergence unit 4 adopts an algorithm of a preset model to iterate the weights of the nodes to enable the weights to be converged; finally, the candidate term extraction unit 5 sorts the weights of the nodes, and extracts adjacent phrases as candidate terms when the seed words arranged in sequence form the adjacent phrases.
Compared with the prior art, the text information-based term extraction system 10 disclosed by the invention solves the problem that a large number of non-domain compound words or terms are extracted due to the fact that the Chinese grammar level is not considered in the prior art, and the text information-based term extraction system 10 disclosed by the invention can fully consider the problem of the Chinese grammar level, has the characteristics of automation and dynamic update, and meets the requirement of high-speed extraction of modern massive text terms.
Example III
Referring to fig. 5, fig. 5 is a schematic structural diagram of a text information based term extracting apparatus 20 according to an embodiment of the present invention. The text information based term extracting device 20 of this embodiment includes: a processor 21, a memory 22 and a computer program stored in said memory 22 and executable on said processor 21. The processor 21, when executing the computer program, implements the steps of the above-described respective text information based term extraction method embodiments, such as steps S1 to S5 shown in fig. 1. Alternatively, the processor 21 may implement the functions of the modules/units in the above-described device embodiments when executing the computer program.
Illustratively, the computer program may be partitioned into one or more modules/units that are stored in the memory 22 and executed by the processor 21 to complete the present invention. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions for describing the execution of the computer program in the text information based term extracting device 20. For example, the computer program may be divided into a text preprocessing unit 1 to be processed, a seed word collection recording unit 2, a seed word network construction unit 3, a convergence unit 4, a candidate term extraction unit 5, a candidate term extraction unit 6, and a term dictionary generating unit 7, and specific functions of each module unit refer to specific functions of each unit in the text information based term extraction system 10 described in the above second embodiment, which are not described herein.
The text-based term extraction device 20 may be a computing device such as a desktop computer, a notebook computer, a palm top computer, and a cloud server. The text information based term extracting device 20 may include, but is not limited to, a processor 21, a memory 22. It will be appreciated by those skilled in the art that the schematic diagram is merely an example of the text information based term extraction device 20 and is not meant to be limiting of the text information based term extraction device 20, and may include more or fewer components than shown, or may combine certain components, or different components, e.g., the text information based term extraction device 20 may also include input and output devices, network access devices, buses, etc.
The processor 21 may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like that is a control center of the text-based term extracting device 20, connecting various parts of the entire text-based term extracting device 20 using various interfaces and lines.
The memory 22 may be used to store the computer program and/or module, and the processor 21 may implement various functions of the text information based term extracting device 20 by executing or executing the computer program and/or module stored in the memory 22 and invoking data stored in the memory 22. The memory 22 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like; the storage data area may store data (such as audio data, phonebook, etc.) created according to the use of the handset, etc. In addition, the memory 22 may include high-speed random access memory, and may also include non-volatile memory, such as a hard disk, memory, plug-in hard disk, smart Media Card (SMC), secure Digital (SD) Card, flash Card (Flash Card), at least one disk storage device, flash memory device, or other volatile solid-state storage device.
Wherein the modules/units integrated by the text-based term extracting device 20 may be stored in a computer readable storage medium if implemented in the form of software functional units and sold or used as a separate product. Based on such understanding, the present invention may implement all or part of the flow of the method of the above embodiment, or may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and the computer program may implement the steps of each of the method embodiments described above when executed by the processor 21. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the computer readable medium contains content that can be appropriately scaled according to the requirements of jurisdictions in which such content is subject to legislation and patent practice, such as in certain jurisdictions in which such content is subject to legislation and patent practice, the computer readable medium does not include electrical carrier signals and telecommunication signals.
It should be noted that the above-described apparatus embodiments are merely illustrative, and the units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. In addition, in the drawings of the embodiment of the device provided by the invention, the connection relation between the modules represents that the modules have communication connection, and can be specifically implemented as one or more communication buses or signal lines. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
While the foregoing is directed to the preferred embodiments of the present invention, it will be appreciated by those skilled in the art that changes and modifications may be made without departing from the principles of the invention, such changes and modifications are also intended to be within the scope of the invention.

Claims (5)

1. A text information-based term extraction method, comprising:
acquiring a text to be processed, and preprocessing the text to be processed;
extracting words meeting mutual information judgment indexes and context-dependent judgment indexes from the text to be processed, and recording the words into a seed word set;
constructing a seed word network based on nodes of the seed word set and edges of the nodes; the node is any seed word in the seed word set, and the edge of the node is the seed word adjacent to the current node;
defining the weight of the node, and iterating the weight of the node through a preset model until the weight of the node converges;
sorting the weight values of the nodes, and extracting adjacent phrases as candidate terms when the seed words arranged in sequence form the adjacent phrases; wherein, the adjacent phrase meets the preset term rule;
calculating the support and the confidence of the candidate term in a database; wherein the database comprises a plurality of words in a preset field, and the support reveals the term m i And m is equal to j The probability of simultaneous occurrence is expressed as: support (m) i ->m j )=P(m i ∪m j ) The method comprises the steps of carrying out a first treatment on the surface of the The confidence level reveals the term m i After appearance, the term m j Whether or not or how likely it is that it will appear, the formula is: confidence (m) i ->m j )=P(m i |m j );
Extracting the candidate terms to form a term dictionary of a preset domain when the candidate terms belong to the preset domain;
the preprocessing the text to be processed specifically comprises the following steps:
carrying out minimum unit division of words on the text to be processed by utilizing an hanlp word segmentation system; the minimum unit represents a single word which can be divided into the text to be processed under the current word segmentation system;
the mutual information judgment index satisfies the following formula:
Figure FDA0004074546040000011
wherein the word string s=t 1 t 2 …t i ,t i A word or a word combination segmented by the hanlp segmentation system; f (t) i ) Representing t i The frequency of occurrence; n is n i Is the number of times the word string S appears, N i The number of occurrences of all words in the database;
the context dependent decision index satisfies the following formula:
H(W|t i )=-∑ w∈W p(w|t i )*log 2 p(w|t i ) Formula (2);
wherein w represents t within a particular window i Probability of a particular word appearing again in the case of occurrence; w is expressed as t in a particular window i A set of all the specific words reappears in the case of appearance; the specific window is a window with a specific length for the text to be processed, and the window with the specific length contains a plurality of words.
2. The text information based term extraction method of claim 1, wherein the defining the weight of the node and iterating the weight of the node through a preset model until the weight of the node converges specifically includes:
defining the weight of the node by adopting semantic relevance; wherein the semantic relevance satisfies the following formula:
Figure FDA0004074546040000021
wherein w is ij Is the word t i And t j Semantic relevance between nodes represents the importance degree of edge connection between nodes;
iterating the weight of the node through the Textrank model until the weight of the node converges; wherein the iterative process satisfies the following formula:
Figure FDA0004074546040000022
wherein WS (t) i ) Representing node t i Is of importance of (a); d represents a damping coefficient, typically less than 1; t is t j ∈In(t i ) The representation being the word t i Following word t j Afterwards; t is t k ∈Out(t j ) Representation word t k Following word t j Afterwards; WS (t) j ) Representing node t j Is of importance of (a); w (w) jk Is the word t j And t k Semantic relatedness between them.
3. The text information based term extraction method of claim 1, wherein the extracting the adjacent phrase as a candidate term specifically includes:
and extracting the adjacent phrases by using the sliding window as candidate terms.
4. A text-based term extraction system, comprising:
the text pretreatment unit is used for obtaining the text to be treated and carrying out pretreatment on the text to be treated;
the seed word set recording unit is used for extracting words meeting mutual information judgment indexes and context-dependent judgment indexes from the text to be processed and recording the words into a seed word set;
a seed word network construction unit, configured to construct a seed word network based on nodes of the seed word set and edges of the nodes; the node is any seed word in the seed word set, and the edge of the node is the seed word adjacent to the current node;
the convergence unit is used for defining the weight of the node and iterating the weight of the node through a preset model until the weight of the node converges;
the candidate term extraction unit is used for sequencing the weights of the nodes, and extracting adjacent phrases as candidate terms when the seed words which are sequentially arranged form the adjacent phrases; wherein, the adjacent phrase meets the preset term rule;
a support and confidence calculating unit for calculating the support and confidence of the candidate term in the database; wherein the database comprises a plurality of words in a preset field, and the support reveals the term m i And m is equal to j The probability of simultaneous occurrence is expressed as: support (m) i ->m j )=P(m i ∪m j ) The method comprises the steps of carrying out a first treatment on the surface of the The confidence level reveals the term m i After appearance, the term m j Whether or not or how likely it is that it will appear, the formula is: confidence (m) i ->m j )=P(m i |m j );
A term dictionary generating unit configured to extract a term dictionary of a preset domain constituted by the candidate terms when the candidate terms belong to the preset domain;
the text preprocessing unit to be processed is specifically configured to:
carrying out minimum unit division of words on the text to be processed by utilizing an hanlp word segmentation system; the minimum unit represents a single word which can be divided into the text to be processed under the current word segmentation system;
the mutual information judgment index satisfies the following formula:
Figure FDA0004074546040000041
wherein the word string s=t 1 t 2 …t i ,t i A word or a word combination segmented by the hanlp segmentation system; f (t) i ) Representing t i The frequency of occurrence; n is n i Is the number of times the word string S appears, N i The number of occurrences of all words in the database;
the context dependent decision index satisfies the following formula:
H(W|t i )=-∑ w∈W p(w|t i )*log 2 p(w|t i ) Formula (2);
wherein w representsWithin a particular window t i Probability of a particular word appearing again in the case of occurrence; w is expressed as t in a particular window i A set of all the specific words reappears in the case of appearance; the specific window is a window with a specific length for the text to be processed, and the window with the specific length contains a plurality of words.
5. A text information based term extracting device, characterized by comprising a processor, a memory and a computer program stored in the memory and configured to be executed by the processor, the processor implementing the text information based term extracting method according to any one of claims 1 to 3 when executing the computer program.
CN201910063975.1A 2019-01-23 2019-01-23 Text information-based term extraction method, system and equipment Active CN109902290B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910063975.1A CN109902290B (en) 2019-01-23 2019-01-23 Text information-based term extraction method, system and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910063975.1A CN109902290B (en) 2019-01-23 2019-01-23 Text information-based term extraction method, system and equipment

Publications (2)

Publication Number Publication Date
CN109902290A CN109902290A (en) 2019-06-18
CN109902290B true CN109902290B (en) 2023-06-30

Family

ID=66944048

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910063975.1A Active CN109902290B (en) 2019-01-23 2019-01-23 Text information-based term extraction method, system and equipment

Country Status (1)

Country Link
CN (1) CN109902290B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021189291A1 (en) * 2020-03-25 2021-09-30 Metis Ip (Suzhou) Llc Methods and systems for extracting self-created terms in professional area
CN111680128A (en) * 2020-06-16 2020-09-18 杭州安恒信息技术股份有限公司 Method and system for detecting web page sensitive words and related devices
CN112966508B (en) * 2021-04-05 2023-08-25 集智学园(北京)科技有限公司 Universal automatic term extraction method
CN115130472B (en) * 2022-08-31 2023-02-21 北京澜舟科技有限公司 Method, system and readable storage medium for segmenting subwords based on BPE
CN116756298B (en) * 2023-08-18 2023-10-20 太仓市律点信息技术有限公司 Cloud database-oriented AI session information optimization method and big data optimization server

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102360383A (en) * 2011-10-15 2012-02-22 西安交通大学 Method for extracting text-oriented field term and term relationship
CN107329950A (en) * 2017-06-13 2017-11-07 武汉工程大学 It is a kind of based on the Chinese address segmenting method without dictionary
CN108287825A (en) * 2018-01-05 2018-07-17 中译语通科技股份有限公司 A kind of term identification abstracting method and system
CN108549626A (en) * 2018-03-02 2018-09-18 广东技术师范学院 A kind of keyword extracting method for admiring class

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102360383A (en) * 2011-10-15 2012-02-22 西安交通大学 Method for extracting text-oriented field term and term relationship
CN107329950A (en) * 2017-06-13 2017-11-07 武汉工程大学 It is a kind of based on the Chinese address segmenting method without dictionary
CN108287825A (en) * 2018-01-05 2018-07-17 中译语通科技股份有限公司 A kind of term identification abstracting method and system
CN108549626A (en) * 2018-03-02 2018-09-18 广东技术师范学院 A kind of keyword extracting method for admiring class

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于上下文关系和TextRank算法的关键词提取方法;杜海舟 等;《上海电力学院学报》;20171230;607-612页 *
基于关联规则和语义规则的本体概念提取研究;贺海涛 等;《吉林大学学报(信息科学版)》;20141130;657-663页 *

Also Published As

Publication number Publication date
CN109902290A (en) 2019-06-18

Similar Documents

Publication Publication Date Title
CN109902290B (en) Text information-based term extraction method, system and equipment
US11301637B2 (en) Methods, devices, and systems for constructing intelligent knowledge base
CN112347778A (en) Keyword extraction method and device, terminal equipment and storage medium
CN110347790B (en) Text duplicate checking method, device and equipment based on attention mechanism and storage medium
CN108073571B (en) Multi-language text quality evaluation method and system and intelligent text processing system
CN111177375B (en) Electronic document classification method and device
CN112434134B (en) Search model training method, device, terminal equipment and storage medium
CN110705247A (en) Based on x2-C text similarity calculation method
CN113590811B (en) Text abstract generation method and device, electronic equipment and storage medium
CN115062621A (en) Label extraction method and device, electronic equipment and storage medium
CN111046662B (en) Training method, device and system of word segmentation model and storage medium
CN117057346A (en) Domain keyword extraction method based on weighted textRank and K-means
CN107729509B (en) Discourse similarity determination method based on recessive high-dimensional distributed feature representation
CN112836491B (en) NLP-oriented Mashup service spectrum clustering method based on GSDPMM and topic model
CN113157857B (en) Hot topic detection method, device and equipment for news
CN110472140B (en) Object word recommendation method and device and electronic equipment
CN112948561A (en) Method and device for automatically expanding question-answer knowledge base
CN112948570A (en) Unsupervised automatic domain knowledge map construction system
CN111813934B (en) Multi-source text topic model clustering method based on DMA model and feature division
CN117573956B (en) Metadata management method, device, equipment and storage medium
CN117725555B (en) Multi-source knowledge tree association fusion method and device, electronic equipment and storage medium
CN111159393B (en) Text generation method for abstract extraction based on LDA and D2V
CN111125350B (en) Method and device for generating LDA topic model based on bilingual parallel corpus
Liang et al. Learning mention and relation representation with convolutional neural networks for relation extraction
CN116502637A (en) Text keyword extraction method combining context semantics

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant