CN111814474B - Domain phrase mining method and device - Google Patents

Domain phrase mining method and device Download PDF

Info

Publication number
CN111814474B
CN111814474B CN202010957899.1A CN202010957899A CN111814474B CN 111814474 B CN111814474 B CN 111814474B CN 202010957899 A CN202010957899 A CN 202010957899A CN 111814474 B CN111814474 B CN 111814474B
Authority
CN
China
Prior art keywords
phrase
phrases
domain
word
field
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010957899.1A
Other languages
Chinese (zh)
Other versions
CN111814474A (en
Inventor
辛秉哲
周源
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhizhe Sihai Beijing Technology Co Ltd
Original Assignee
Zhizhe Sihai Beijing Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhizhe Sihai Beijing Technology Co Ltd filed Critical Zhizhe Sihai Beijing Technology Co Ltd
Priority to CN202010957899.1A priority Critical patent/CN111814474B/en
Publication of CN111814474A publication Critical patent/CN111814474A/en
Application granted granted Critical
Publication of CN111814474B publication Critical patent/CN111814474B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The disclosure provides a method and a device for mining field phrases, comprising the following steps: extracting N-gram characteristics of the sample sentences with the domain labels, and selecting the N-gram characteristics with the frequency greater than a preset value as a word list; traversing the sample sentence based on the word list to generate a word bag characteristic; inputting the bag-of-words characteristics and the field labels into a sorting model, sorting the importance of the characteristics in the bag-of-words characteristics by the sorting model, and selecting the characteristics with the importance greater than a threshold value as important phrases in the field to be output; expanding the important phrases through inflexion deformation to generate an expanded phrase set; and searching adjacent domain phrases in the sample sentence by using any phrase in the extended phrase set, and adding the adjacent domain phrases into the extended phrase set to obtain the domain phrase set. The field phrase mining method and device provided by the disclosure can effectively solve the problems of low mining efficiency, small mining quantity and low accuracy rate of the field phrase mining method in the prior art.

Description

Domain phrase mining method and device
Technical Field
The disclosure relates to the technical field of computer internet, in particular to a field phrase mining method and device, electronic equipment and a computer readable medium.
Background
In the natural language processing service, domain identification of content and recall of content in some domains (e.g., political domains) are required to ensure the security of the content. Due to the diversity of web languages, domain phrases which are as accurate as possible need to be mined, and the domain phrases are applied to perform domain identification on content, so that the recall rate is improved.
The method for mining the domain phrases in the prior art comprises an unsupervised mining method and a supervised mining method, but the phrases mined by the existing unsupervised mining method are not necessarily the domain phrases and need to be further identified, so that the problem of low mining efficiency exists; the existing supervision mining method has the problems of small quantity of mined domain phrases and low accuracy. Therefore, it is necessary to provide a domain phrase mining method with high mining efficiency, large mining number and high accuracy.
Disclosure of Invention
In view of this, the present disclosure provides a method and an apparatus for mining domain phrases, which can effectively solve the problems of low mining efficiency, small mining number and low accuracy in the method for mining domain phrases in the prior art.
The following presents a simplified summary of the disclosure in order to provide a basic understanding of some aspects of the disclosure. It should be understood that this summary is not an exhaustive overview of the disclosure. It is not intended to identify key or critical elements of the disclosure or to delineate the scope of the disclosure. Its sole purpose is to present some concepts in a simplified form as a prelude to the more detailed description that is discussed later.
According to a first aspect of the present disclosure, a domain phrase mining method is provided, including:
extracting N-gram characteristics of the sample sentences with the domain labels, and selecting the N-gram characteristics with the frequency greater than a preset value as a word list;
traversing the sample sentence based on the word list to generate a bag-of-words feature comprising the feature and a word frequency vector of the feature;
inputting the bag-of-words characteristics and the field labels into a sorting model, sorting the importance of the characteristics in the bag-of-words characteristics by the sorting model, and selecting the characteristics with the importance greater than a threshold value as important phrases in the field to be output;
expanding the important phrases through inflexion deformation to generate an expanded phrase set;
and searching adjacent domain phrases in the sample sentence by using any phrase in the extended phrase set, and adding the adjacent domain phrases into the extended phrase set to obtain the domain phrase set.
In some embodiments, the domain phrase mining method further comprises:
performing word segmentation on the sample sentence, and obtaining a new word based on the word segmentation;
and combining the new words with the selected N-gram characteristics with the frequency greater than the preset value to form a word list.
Further, obtaining the new word based on the word segmentation includes obtaining the new word through an unsupervised method.
In some embodiments, the N-gram features have a feature length N of 2-4.
In some embodiments, the domain phrase mining method further comprises:
and merging the important phrases and the existing domain phrases to obtain an initial phrase set, and expanding the initial phrase set through inflexion deformation to generate an expanded phrase set.
In some embodiments, searching for neighboring domain phrases in the sample sentence using any phrase in the extended phrase set specifically includes:
performing word segmentation and word segmentation on the sample sentence, and generating a corresponding word vector and a corresponding word vector;
vectorizing the phrases in the extended phrase set to obtain a vector corresponding to any phrase;
and calculating the similarity between the vector corresponding to any phrase and the word segmentation vector generated by the sample sentence, and selecting the word segmentation with the similarity larger than a preset value as the field phrase adjacent to the phrase.
In some embodiments, the domain phrase mining method further comprises:
calculating the frequency of any phrase pair in the field phrase set appearing in the sample sentences of the field, selecting the phrase pair with the frequency exceeding a preset value, and using the selected phrase pair to determine whether the new language material belongs to the field.
According to a second aspect of the present disclosure, there is provided a domain phrase mining apparatus including:
the word list construction unit is used for extracting N-gram characteristics of the sample sentences with the field labels and selecting the N-gram characteristics with the frequency greater than a preset value as word lists;
the system comprises a bag-of-words feature generation unit, a bag-of-words feature generation unit and a bag-of-words feature generation unit, wherein the bag-of-words feature generation unit is used for traversing sample sentences based on word lists and generating bag-of-words features comprising features and word frequency vectors of the features;
the sorting unit is used for receiving the bag-of-words characteristics and the field labels, sorting the importance of the characteristics in the bag-of-words characteristics, and selecting the characteristics with the importance greater than a threshold value as important phrases in the field to be output;
the extension unit is used for extending the important phrases through inflexion deformation to generate an extended phrase set;
and the neighbor searching unit is used for searching adjacent domain phrases in the sample sentence by using any phrase in the extended phrase set, and adding the adjacent domain phrases into the extended phrase set to obtain the domain phrase set.
According to a third aspect of the present disclosure, there is provided an electronic device comprising:
one or more processors;
a memory for storing one or more programs;
wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method as provided by the first aspect of the disclosure.
According to a fourth aspect of the present disclosure, there is provided a computer readable medium having stored thereon executable instructions which, when executed by a processor, cause the processor to perform the method as provided by the first aspect of the present disclosure.
The method utilizes the N-gram feature extraction and sequencing model to mine the domain phrases, and carries out expansion and neighbor search on the basis, thereby effectively solving the problems of low mining efficiency, small mining quantity and low accuracy rate of the existing domain phrase mining.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts. The foregoing and other objects, features and advantages of the application will be apparent from the accompanying drawings. Like reference numerals refer to like parts throughout the drawings. The drawings are not intended to be to scale as practical, emphasis instead being placed upon illustrating the subject matter of the present application.
FIG. 1 is a flowchart of a domain phrase mining method provided according to an embodiment of the present disclosure.
Fig. 2 is a schematic diagram of a domain phrase mining device according to an embodiment of the present disclosure.
Fig. 3 is a schematic structural diagram of an electronic device provided according to an embodiment of the present disclosure.
Detailed Description
The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Meanwhile, relational terms such as "first," "second," and the like may be used solely in the description herein to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
Further, the term "and/or" in the present application is only one kind of association relationship describing the associated object, and means that three kinds of relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone.
Exemplary embodiments of the present disclosure will be described below with reference to the accompanying drawings. In the interest of clarity and conciseness, not all features of an actual embodiment are described in the specification. It will of course be appreciated that in the development of any such actual embodiment, numerous implementation-specific decisions may be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which will vary from one implementation to another.
Here, it should be further noted that, in order to avoid obscuring the present disclosure with unnecessary details, only the device structure closely related to the scheme according to the present disclosure is shown in the drawings, and other details not so related to the present disclosure are omitted.
It is to be understood that the disclosure is not limited to the described embodiments, as described below with reference to the drawings. In this context, embodiments may be combined with each other, features may be replaced or borrowed between different embodiments, one or more features may be omitted in one embodiment, where feasible.
The method for mining the phrases in the prior art mainly comprises two types, one type is an unsupervised mining method, and comprises the steps of discovering new words by using methods such as mutual information, freedom degree and the like, and expanding the new words by using methods such as clustering, near word expansion, keyword deformation and the like based on seed keywords, so as to obtain the new words; the other type is a supervised mining method, which comprises the steps of counting phrase weights based on f-ngram-idf, textrank and the like, selecting phrases with larger weights as candidate sets of newly found phrases, and performing new phrase discovery by using a classification or sequence labeling model and the like.
In order to solve the problems, the method utilizes an N-gram feature extraction and feature importance ranking model to mine the domain phrases, and carries out expansion and neighbor search on the mined phrases, so that the efficiency of mining the domain phrases is effectively improved, and the quantity and accuracy of the mined phrases are ensured.
First, a domain phrase mining method is provided in the embodiments of the present disclosure, and a specific description is given below of a domain phrase mining method provided in the embodiments of the present disclosure.
FIG. 1 illustrates a flow diagram of a domain phrase mining method 100 provided in accordance with an embodiment of the present disclosure. The method specifically comprises the following steps:
step 110: and extracting N-gram characteristics of the sample sentences with the field labels, and selecting the N-gram characteristics with the frequency greater than a preset value as a word list.
Here, the field label is a label indicating whether the sample sentence belongs to a certain field, for example, the field label may be a binary label in the embodiment of the present disclosure, and when the sample sentence belongs to a certain field, the label is set to "1", and when the sample sentence does not belong to a certain field, the label is set to "0". It should be noted that the binary label is only an example provided by the embodiment of the present disclosure, and the form of the domain label is not specifically limited by the present disclosure.
The N-gram is an algorithm based on a statistical language model, and can perform sliding window operation with the size of N on the content in the text according to bytes, so that N-gram characteristics with the length of N are formed. And performing N-gram feature extraction on the sample statement, namely inputting the sample statement into an N-gram model and forming an N-gram feature with the length of N. For example, the sentence of 'Xiaoming bus sitting at the bus to go to school' is subjected to N-gram feature extraction, and when the N-gram feature length N is 2, the 2-gram features extracted from the sentence are { 'Xiaoming', 'Ming sitting at the bus', 'bus' and 'bus-to-bus', 'go to the bus', 'go to school' }.
In the embodiment of the disclosure, the sample sentences with the domain labels can be provided as shown in table 1, the sample sentences in table 1 are subjected to N-gram feature extraction, and then the N-gram features with the frequency greater than the preset value are selected as the vocabulary.
TABLE 1 sample statement with Domain tag
Sample sentence Domain label (1 for entertainment, 0 for non-entertainment)
Chinese television actor 1
American actor Lieqiu-four present the awards ceremony 1
The king wu of France actor present the awards ceremony and the Zhang liu of Japanese singer 1
China sports representative team attending 31 st summer Olympic Games in Ri Yong Lu in Lu 0
American sports representative team attending 31 st summer Olympic Games in Rio Rev Luo open curtain 0
German sports representative team attending 31 st summer Olympic Games in Riyote Renlu-Luo-Kai 0
Japanese sports representative team attending 31 st summer Olympic Games in Rio Rev-Luo-Fair 0
In some cases, the characteristic length N of the N-gram features can be selected to be 2-4. And (3) extracting N-gram features with the feature length of 2-4 from all sample sentences in the table 1, summarizing all the extracted N-gram features and counting the frequency of each N-gram feature, wherein the frequency is the occurrence frequency. The N-gram features with the frequency greater than the preset value are selected as a vocabulary, in the embodiment of the disclosure, the preset value can be 5, and then the vocabulary consisting of the N-gram features with the frequency greater than 5 is selected as { "sports", "actor", "curtain", "ceremony", "attendance", "awarding", "olympic ceremony", "China", "United states", "Germany" }.
It should be noted that the sample sentence and the domain label provided in the embodiment of the disclosure are only examples, and those skilled in the art may select other sample sentences and domain labels according to the needs, and the disclosure does not limit this. The person skilled in the art can also extract N-gram features with other feature lengths from the sample sentence, and the length of the N-gram features is not limited by the disclosure.
In some cases, the sample sentences can be segmented, new words are obtained through an unsupervised method based on the segmentation, and the new words are combined with the selected N-gram features with the frequency greater than the preset value to form a word list. In the embodiment of the present disclosure, the unsupervised method may be a mutual information calculation or clustering method, and the present disclosure does not specifically limit this.
For example, a sample sentence "the awards ceremony is presented by the french actor king puppy, and the japanese singer zhaowu is participled, and the generated participle includes" wangbai ", assuming that two new words of" wangbai "and" lie puppy "are available by an unsupervised method based on the participle. At this time, two new words of "wang xiao wu" and "li xiao si" may be combined with the selected N-gram feature with the frequency greater than 5 to form a word list { "sports", "actor", "screen", "ceremony", "attendance", "awards", "olympic games", "china", "usa", "germany", "wang xiao wu", "li xiao si".
It should be noted that the above is only an example, and in the embodiment of the present disclosure, all sample sentences are subjected to word segmentation, a word segmentation set is obtained after combining and de-duplicating word segmentation results, a new word is obtained through an unsupervised method based on each word segmentation in the word segmentation set, and all the obtained new words are combined with the selected N-gram features with the frequency greater than the preset value to form a word list.
In the embodiment of the disclosure, a new word obtained based on the word segmentation of the sample sentence can effectively supplement a word list formed by N-gram characteristics, and the omission of low-frequency number field characteristics caused by only selecting the N-gram characteristics with the frequency greater than a preset value is avoided.
Step 120: and traversing the sample sentence based on the word list to generate the bag-of-words feature comprising the feature and the word frequency vector of the feature.
In the embodiment of the present disclosure, the feature of the bag-of-words feature may be a word in a word list. Traversing the sample sentences based on the word list, wherein each sample sentence can be traversed sequentially based on the word list, if the word at the corresponding position in the word list appears in the sample, the times of the word appearing in all the samples are further counted, and the corresponding position is represented by the times of the word appearing in all the samples; if the word at the corresponding position in the word list does not appear in the sample, the corresponding position is represented by 0, so that a word frequency vector of each sample sentence can be generated, and the dimension of the vector is consistent with the number of the words in the word list.
For example, in the embodiment of the present disclosure, each sample statement in table 1 is traversed based on a word table, and an obtained word frequency vector is shown in table 2; in this example, the features in the generated bag of words features are words in a vocabulary.
TABLE 2 word frequency vector for sample sentence
Sample sentence Word frequency vector
Chinese television actor [0,0,0,0,0,0,0,0,2,0,0,0,0]
American actor Lieqiu-four present the awards ceremony [0,2,0,2,6,2,2,0,0,2,0,0,1]
The king wu of France actor present the awards ceremony and the Zhang liu of Japanese singer [0,2,0,2,6,2,2,0,0,0,0,1,0]
China sports representative team attends the 31 st summer Olympic GamesInterior heat and interior heat curtain type [4,0,4,0,6,0,0,4,2,0,0,0,0]
American sports representative team attending 31 st summer Olympic Games in Rio Rev Luo open curtain [4,0,4,0,6,0,0,4,0,2,0,0,0]
German sports representative team attending 31 st summer Olympic Games in Riyote Renlu-Luo-Kai [4,0,4,0,6,0,0,4,0,0,1,0,0]
Japanese sports representative team attending 31 st summer Olympic Games in Rio Rev-Luo-Fair [4,0,4,0,6,0,0,4,0,0,0,0,0]
In some cases, the obtaining form of the word frequency vector of the sample statement may also be that each sample statement is sequentially traversed based on a word list, and if a word at a corresponding position in the word list appears in the sample, the corresponding position is represented by 1; if the word in the corresponding position in the word list does not appear in the sample, the corresponding position is represented by 0, so that another word frequency vector of each sample sentence can be generated, and the dimension of the vector is consistent with the number of the words in the word list.
Step 130: inputting the bag-of-words characteristics and the field labels into a sorting model, sorting the importance of the characteristics in the bag-of-words characteristics by the sorting model, and selecting the characteristics with the importance greater than a threshold value as important phrases in the field to be output.
In the embodiment of the present disclosure, the ranking model may be a GBDT model, or may be another model capable of ranking the importance of features in the bag-of-words feature, which is not limited in the present disclosure.
And when the sequencing model is the GBDT model, inputting the domain labels of the sample sentences and the bag-of-words features acquired in the step 120 into the GBDT model, wherein the GBDT model can sequence the importance of the features in the bag-of-words features, and selects the features with the importance greater than a certain threshold value as important phrases in the domain to be output.
Step 140: and expanding the important phrases through inflexion deformation to generate an expanded phrase set.
In the embodiment of the disclosure, the important phrases are expanded by changing the pronunciation, and the important phrases can be expanded by replacing homophones; the method can also be expanded by replacing approximate sounds, for example, the vowels { ("ing", "in"), ("eng", "en"), ("ang", "an") }, which are difficult to distinguish, are replaced.
In the embodiment of the disclosure, the important phrase is expanded through deformation, and the important phrase can be expanded through inquiring a four-corner number table and selecting the Chinese character with the same code as the Chinese character in the important phrase as a substitute. For example, if an important phrase includes a "peak" word, and the four corner number of the "peak" word is 27754, other Chinese characters corresponding to the code 27754, such as "", are randomly selected, and the phrase " meeting" can be expanded.
It should be noted that the inflexion method provided above is only an example, and those skilled in the art may select other inflexion methods capable of expanding phrases according to needs, and the disclosure does not limit this.
In some cases, some existing domain phrases may be held in advance, and at this time, the important phrases acquired in step 130 and the existing domain phrases may be merged to obtain an initial phrase set, and the initial phrase set is extended by inflexion to generate an extended phrase set. In this way, the important phrases acquired in step 130 can be supplemented by fully utilizing the existing domain phrases, thereby effectively increasing the number of the mined domain phrases.
In view of the diversity of the network language, important phrases or domain phrases are expanded by adopting a method of sound variation and deformation, so that the quantity of the excavated domain phrases can be further effectively increased, and the excavation efficiency is improved; and because the inflexion deformation word usually has very high accuracy, therefore, the accuracy of the phrase expanded by the method is also high, and the expanded phrase is applied to identify the sentence to be identified on the network, so that the identification accuracy and recall rate of the related phrase in the field can be effectively improved.
Step 150: and searching adjacent domain phrases in the sample sentence by using any phrase in the extended phrase set, and adding the adjacent domain phrases into the extended phrase set to obtain the domain phrase set.
In the embodiment of the disclosure, searching adjacent domain phrases in a sample sentence by using any phrase in an extended phrase set may include performing word segmentation and word segmentation on the sample sentence, and generating a corresponding word vector and a corresponding word vector; vectorizing the phrases in the extended phrase set to obtain a vector corresponding to any phrase; and calculating the similarity between the vector corresponding to any phrase and the word segmentation vector generated by the sample sentence, and selecting the word segmentation with the similarity larger than a preset value as the field phrase adjacent to the phrase.
In the embodiment of the disclosure, after the word segmentation and the word segmentation are performed on the sample sentence, word vectors and word vectors corresponding to the word segmentation and the word segmentation can be generated by adopting a word2vec method.
It should be noted that the word segmentation and word segmentation vectorization of the sample sentence by using the word2vec method are only an example provided in the embodiment of the present disclosure, and those skilled in the art may also select other vectorization methods according to the requirement, which is not limited by the present disclosure.
In the embodiment of the present disclosure, the phrases in the extended phrase set are vectorized to obtain a vector corresponding to any phrase, and the following method may be adopted:
in some cases, if a certain phrase in the extended phrase set is included in the word segmentation result of the sample sentence, the word segmentation vector may be directly adopted as the vector of the phrase, for example, if the phrase "actor" in the extended phrase set is included in the word segmentation result of the sample sentence, the word segmentation vector of the word segmentation of the "actor" may be directly adopted as the vector of the phrase "actor" in the extended phrase set.
In some cases, if a certain phrase in the extended phrase set is not included in the word segmentation result of the sample sentence, but the phrase is a combination of the word segmentation in the sample sentence or a combination of the word segmentation and the word segmentation, the vector of the phrase may be calculated by using the vector corresponding to the word segmentation or the word segmentation, for example, the vectors corresponding to the word segmentation or the word segmentation may be summed in each dimension and then averaged, thereby obtaining the vector of the phrase.
In some cases, if a certain phrase in the extended phrase set is not included in the word segmentation result of the sample sentence, and some words in the phrase do not appear in the sample sentence, then vector representation is performed on the words by using other methods, for example, after the words are converted into special symbols, vectors corresponding to the special symbols are queried from a built-in vector table, then vectors corresponding to other words in the phrase are summed in each dimension, and then the average value is taken, so as to obtain the vector of the phrase.
In the embodiment of the present disclosure, the similarity between the vector corresponding to any phrase and the word segmentation vector generated by the sample sentence is calculated, and the word segmentation with the similarity greater than the preset value is selected as the field phrase adjacent to the phrase. Further, the obtained adjacent domain phrases may be added to the extended phrase set, so as to obtain a final mined domain phrase set.
It should be noted that the method for searching for neighboring domain phrases by calculating similarity using cosine distances provided in the present disclosure is only an example, and those skilled in the art may also select other methods that can search for neighboring domain phrases in a sample sentence using phrases in an extended phrase set according to the needs, and the present disclosure is not limited thereto.
The method for mining the field phrases based on the sample sentences is provided, and on the basis of obtaining the field phrase set, the embodiment of the disclosure further provides a method for identifying the field of a new sentence by applying the field phrase set.
For example, suppose that in the case where the obtained entertainment domain phrase set is { "actor", "ceremony", "attendance", "awards" } that { "actor", "ceremony" is a phrase pair in the domain phrase set, the phrase pair appears 2 times in the entertainment domain sample sentence provided in table 1 of the embodiment of the present disclosure, and the entertainment domain sample data is 3 pieces in total, and therefore, the frequency of appearance of the phrase pair in the entertainment domain sample sentence is 0.667. If the preset value of the frequency is set to 0.6, phrase pairs { "actor", "ceremony" } with the frequency exceeding the preset value can be selected to perform field recognition on a new sentence, and if the selected phrase pairs { "actor", "ceremony" } are included in the new sentence, the new sentence is marked as belonging to the entertainment field. The high-frequency domain phrase pairs in the sample sentences are used for carrying out domain recognition on the new sentences, so that the recognition accuracy and efficiency can be effectively improved.
The preset value of any phrase in the field phrase set selected in the embodiments of the present disclosure for the frequency appearing in the sample sentence in the field is merely an example, and those skilled in the art may select other suitable preset values as needed, which is not limited by the present disclosure.
A domain phrase mining apparatus provided in an embodiment of the present disclosure is described below. FIG. 2 shows a schematic diagram of a domain phrase mining device 200 provided in accordance with an embodiment of the present disclosure. The device specifically includes:
the word list construction unit 201 is used for extracting N-gram characteristics of the sample sentences with the domain labels and selecting the N-gram characteristics with the frequency greater than a preset value as word lists;
a bag-of-words feature generation unit 202, configured to traverse the sample sentence based on the vocabulary, and generate bag-of-words features including features and a frequency vector of the features;
the sorting unit 203 is used for receiving the bag-of-words features and the field labels, sorting the importance of the features in the bag-of-words features, and selecting the features with the importance greater than a threshold value as important phrases in the field to be output;
an expansion unit 204, configured to expand the important phrases through inflexion and deformation to generate an expansion phrase set;
a neighboring search unit 205, configured to search neighboring domain phrases in the sample sentence by using any phrase in the extended phrase set, and add the neighboring domain phrases into the extended phrase set to obtain a domain phrase set.
The method and the device for mining the domain phrases provided by the embodiment of the disclosure effectively expand the domain phrases based on the sample sentences with the domain tags, have high accuracy of the mined domain phrases, and are suitable for performing domain identification on complex and various network languages.
Fig. 3 shows a schematic structural diagram of an electronic device 300 provided according to an embodiment of the present disclosure. As shown in fig. 3, the electronic apparatus 300 includes a Central Processing Unit (CPU) 301 that can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM) 302 or a program loaded from a storage section 308 into a Random Access Memory (RAM) 303. In the RAM 303, various programs and data necessary for the operation of the electronic apparatus are also stored. The CPU 301, ROM 302, and RAM 303 are connected to each other via a bus 304. An input/output (I/O) interface 305 is also connected to bus 304.
The following components are connected to the I/O interface 305: an input portion 306 including a keyboard, a mouse, and the like; an output section 307 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 308 including a hard disk and the like; and a communication section 309 including a network interface card such as a LAN card, a modem, or the like. The communication section 309 performs communication processing via a network such as the internet. A drive 310 is also connected to the I/O interface 305 as needed. A removable medium 311 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 310 as necessary, so that a computer program read out therefrom is mounted into the storage section 308 as necessary.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer-readable medium bearing instructions that, in such embodiments, may be downloaded and installed from a network via the communication section 309, and/or installed from the removable media 311. The instructions, when executed by the Central Processing Unit (CPU) 301, perform the various method steps described in the present invention.
The above-mentioned embodiments are merely specific embodiments of the present disclosure, which are used for illustrating the technical solutions of the present disclosure and not for limiting the same, and the scope of the present disclosure is not limited thereto, although the present disclosure is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive of the technical solutions described in the foregoing embodiments or equivalent technical features thereof within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the exemplary embodiments of the present disclosure, and should be construed as being included therein.

Claims (8)

1. A domain phrase mining method, comprising:
extracting N-gram characteristics of the sample sentences with the domain labels, and selecting the N-gram characteristics with the frequency greater than a preset value as a word list;
traversing the sample sentence based on the word list to generate a bag-of-words feature comprising the feature and a word frequency vector of the feature;
inputting the bag-of-words characteristics and the field labels into a sorting model, sorting the importance of the characteristics in the bag-of-words characteristics by the sorting model, and selecting the characteristics with the importance greater than a threshold value as important phrases in the field to be output;
expanding the important phrases through inflexion deformation to generate an expanded phrase set;
searching adjacent domain phrases in the sample sentence by using any phrase in the extended phrase set, and adding the adjacent domain phrases into the extended phrase set to obtain a domain phrase set;
calculating the frequency of any phrase pair in the field phrase set appearing in the sample sentences of the field, selecting the phrase pair with the frequency exceeding a preset value, and using the selected phrase pair to determine whether the new language material belongs to the field.
2. The method of claim 1, further comprising:
performing word segmentation on the sample sentence, and obtaining a new word based on the word segmentation;
and combining the new words with the selected N-gram characteristics with the frequency greater than the preset value to form a word list.
3. The method of claim 2, wherein deriving new words based on the participles comprises deriving new words by an unsupervised method.
4. The method of any of claims 1-3, the N-gram features having a feature length N of 2-4.
5. The method of claim 1, further comprising:
and merging the important phrases and the existing domain phrases to obtain an initial phrase set, and expanding the initial phrase set through inflexion deformation to generate an expanded phrase set.
6. The method of claim 1, wherein searching for neighboring domain phrases in the sample sentence using any phrase in the set of augmented phrases comprises:
performing word segmentation and word segmentation on the sample sentence, and generating a corresponding word vector and a corresponding word vector;
vectorizing the phrases in the extended phrase set to obtain a vector corresponding to any phrase;
and calculating the similarity between the vector corresponding to any phrase and the word segmentation vector generated by the sample sentence, and selecting the word segmentation with the similarity larger than a preset value as the field phrase adjacent to the phrase.
7. An electronic device, comprising:
one or more processors;
a memory for storing one or more programs;
wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method of any of claims 1-6.
8. A computer readable medium having stored thereon executable instructions which, when executed by a processor, cause the processor to perform the method of any one of claims 1 to 6.
CN202010957899.1A 2020-09-14 2020-09-14 Domain phrase mining method and device Active CN111814474B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010957899.1A CN111814474B (en) 2020-09-14 2020-09-14 Domain phrase mining method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010957899.1A CN111814474B (en) 2020-09-14 2020-09-14 Domain phrase mining method and device

Publications (2)

Publication Number Publication Date
CN111814474A CN111814474A (en) 2020-10-23
CN111814474B true CN111814474B (en) 2021-01-29

Family

ID=72860712

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010957899.1A Active CN111814474B (en) 2020-09-14 2020-09-14 Domain phrase mining method and device

Country Status (1)

Country Link
CN (1) CN111814474B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112818686B (en) * 2021-03-23 2023-10-31 北京百度网讯科技有限公司 Domain phrase mining method and device and electronic equipment
CN115168895B (en) * 2022-07-08 2023-12-12 深圳市芒果松科技有限公司 User information threat analysis method and server combined with artificial intelligence
CN117034917B (en) * 2023-10-08 2023-12-22 中国医学科学院医学信息研究所 English text word segmentation method, device and computer readable medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160328380A1 (en) * 2014-02-22 2016-11-10 Tencent Technology (Shenzhen) Company Limited Method and apparatus for determining morpheme importance analysis model

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102214238B (en) * 2011-07-01 2012-10-24 临沂大学 Device and method for matching similarity of Chinese words
CN107423398B (en) * 2017-07-26 2023-04-18 腾讯科技(上海)有限公司 Interaction method, interaction device, storage medium and computer equipment
CN109408802A (en) * 2018-08-28 2019-03-01 厦门快商通信息技术有限公司 A kind of method, system and storage medium promoting sentence vector semanteme
CN109325015B (en) * 2018-08-31 2021-07-20 创新先进技术有限公司 Method and device for extracting characteristic field of domain model
CN110674252A (en) * 2019-08-26 2020-01-10 银江股份有限公司 High-precision semantic search system for judicial domain
CN110704391A (en) * 2019-09-23 2020-01-17 车智互联(北京)科技有限公司 Word stock construction method and computing device
CN110688836A (en) * 2019-09-30 2020-01-14 湖南大学 Automatic domain dictionary construction method based on supervised learning

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160328380A1 (en) * 2014-02-22 2016-11-10 Tencent Technology (Shenzhen) Company Limited Method and apparatus for determining morpheme importance analysis model

Also Published As

Publication number Publication date
CN111814474A (en) 2020-10-23

Similar Documents

Publication Publication Date Title
CN111814474B (en) Domain phrase mining method and device
CN107832414B (en) Method and device for pushing information
CN107633007B (en) Commodity comment data tagging system and method based on hierarchical AP clustering
US11379668B2 (en) Topic models with sentiment priors based on distributed representations
CN107463607B (en) Method for acquiring and organizing upper and lower relations of domain entities by combining word vectors and bootstrap learning
US8108413B2 (en) Method and apparatus for automatically discovering features in free form heterogeneous data
CN112364628B (en) New word recognition method and device, electronic equipment and storage medium
JP2011227688A (en) Method and device for extracting relation between two entities in text corpus
CN111859961A (en) Text keyword extraction method based on improved TopicRank algorithm
Lyu et al. Joint word segmentation, pos-tagging and syntactic chunking
Jihan et al. Multi-domain aspect extraction using support vector machines
CN108763192B (en) Entity relation extraction method and device for text processing
Sousa et al. Word sense disambiguation: an evaluation study of semi-supervised approaches with word embeddings
Wang et al. Interactive Topic Model with Enhanced Interpretability.
CN111506726B (en) Short text clustering method and device based on part-of-speech coding and computer equipment
CN114491062B (en) Short text classification method integrating knowledge graph and topic model
Ghosh Sentiment analysis of IMDb movie reviews: A comparative study on performance of hyperparameter-tuned classification algorithms
CN112732863B (en) Standardized segmentation method for electronic medical records
CN111125329B (en) Text information screening method, device and equipment
CN111061939B (en) Scientific research academic news keyword matching recommendation method based on deep learning
D’hondt et al. Topic identification based on document coherence and spectral analysis
CN116933782A (en) E-commerce text keyword extraction processing method and system
Thuy et al. Leveraging foreign language labeled data for aspect-based opinion mining
CN110348497A (en) A kind of document representation method based on the building of WT-GloVe term vector
Gupta et al. Domain adaptation of information extraction models

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant