CN111814474B

CN111814474B - Domain phrase mining method and device

Info

Publication number: CN111814474B
Application number: CN202010957899.1A
Authority: CN
Inventors: 辛秉哲; 周源
Original assignee: Zhizhe Sihai Beijing Technology Co Ltd
Current assignee: Zhizhe Sihai Beijing Technology Co Ltd
Priority date: 2020-09-14
Filing date: 2020-09-14
Publication date: 2021-01-29
Anticipated expiration: 2040-09-14
Also published as: CN111814474A

Abstract

The disclosure provides a method and a device for mining field phrases, comprising the following steps: extracting N-gram characteristics of the sample sentences with the domain labels, and selecting the N-gram characteristics with the frequency greater than a preset value as a word list; traversing the sample sentence based on the word list to generate a word bag characteristic; inputting the bag-of-words characteristics and the field labels into a sorting model, sorting the importance of the characteristics in the bag-of-words characteristics by the sorting model, and selecting the characteristics with the importance greater than a threshold value as important phrases in the field to be output; expanding the important phrases through inflexion deformation to generate an expanded phrase set; and searching adjacent domain phrases in the sample sentence by using any phrase in the extended phrase set, and adding the adjacent domain phrases into the extended phrase set to obtain the domain phrase set. The field phrase mining method and device provided by the disclosure can effectively solve the problems of low mining efficiency, small mining quantity and low accuracy rate of the field phrase mining method in the prior art.

Description

Domain phrase mining method and device

Technical Field

The disclosure relates to the technical field of computer internet, in particular to a field phrase mining method and device, electronic equipment and a computer readable medium.

Background

In the natural language processing service, domain identification of content and recall of content in some domains (e.g., political domains) are required to ensure the security of the content. Due to the diversity of web languages, domain phrases which are as accurate as possible need to be mined, and the domain phrases are applied to perform domain identification on content, so that the recall rate is improved.

The method for mining the domain phrases in the prior art comprises an unsupervised mining method and a supervised mining method, but the phrases mined by the existing unsupervised mining method are not necessarily the domain phrases and need to be further identified, so that the problem of low mining efficiency exists; the existing supervision mining method has the problems of small quantity of mined domain phrases and low accuracy. Therefore, it is necessary to provide a domain phrase mining method with high mining efficiency, large mining number and high accuracy.

Disclosure of Invention

In view of this, the present disclosure provides a method and an apparatus for mining domain phrases, which can effectively solve the problems of low mining efficiency, small mining number and low accuracy in the method for mining domain phrases in the prior art.

The following presents a simplified summary of the disclosure in order to provide a basic understanding of some aspects of the disclosure. It should be understood that this summary is not an exhaustive overview of the disclosure. It is not intended to identify key or critical elements of the disclosure or to delineate the scope of the disclosure. Its sole purpose is to present some concepts in a simplified form as a prelude to the more detailed description that is discussed later.

According to a first aspect of the present disclosure, a domain phrase mining method is provided, including:

extracting N-gram characteristics of the sample sentences with the domain labels, and selecting the N-gram characteristics with the frequency greater than a preset value as a word list;

traversing the sample sentence based on the word list to generate a bag-of-words feature comprising the feature and a word frequency vector of the feature;

inputting the bag-of-words characteristics and the field labels into a sorting model, sorting the importance of the characteristics in the bag-of-words characteristics by the sorting model, and selecting the characteristics with the importance greater than a threshold value as important phrases in the field to be output;

expanding the important phrases through inflexion deformation to generate an expanded phrase set;

and searching adjacent domain phrases in the sample sentence by using any phrase in the extended phrase set, and adding the adjacent domain phrases into the extended phrase set to obtain the domain phrase set.

In some embodiments, the domain phrase mining method further comprises:

performing word segmentation on the sample sentence, and obtaining a new word based on the word segmentation;

and combining the new words with the selected N-gram characteristics with the frequency greater than the preset value to form a word list.

Further, obtaining the new word based on the word segmentation includes obtaining the new word through an unsupervised method.

In some embodiments, the N-gram features have a feature length N of 2-4.

In some embodiments, the domain phrase mining method further comprises:

and merging the important phrases and the existing domain phrases to obtain an initial phrase set, and expanding the initial phrase set through inflexion deformation to generate an expanded phrase set.

In some embodiments, searching for neighboring domain phrases in the sample sentence using any phrase in the extended phrase set specifically includes:

performing word segmentation and word segmentation on the sample sentence, and generating a corresponding word vector and a corresponding word vector;

vectorizing the phrases in the extended phrase set to obtain a vector corresponding to any phrase;

and calculating the similarity between the vector corresponding to any phrase and the word segmentation vector generated by the sample sentence, and selecting the word segmentation with the similarity larger than a preset value as the field phrase adjacent to the phrase.

In some embodiments, the domain phrase mining method further comprises:

calculating the frequency of any phrase pair in the field phrase set appearing in the sample sentences of the field, selecting the phrase pair with the frequency exceeding a preset value, and using the selected phrase pair to determine whether the new language material belongs to the field.

According to a second aspect of the present disclosure, there is provided a domain phrase mining apparatus including:

the word list construction unit is used for extracting N-gram characteristics of the sample sentences with the field labels and selecting the N-gram characteristics with the frequency greater than a preset value as word lists;

the system comprises a bag-of-words feature generation unit, a bag-of-words feature generation unit and a bag-of-words feature generation unit, wherein the bag-of-words feature generation unit is used for traversing sample sentences based on word lists and generating bag-of-words features comprising features and word frequency vectors of the features;

the sorting unit is used for receiving the bag-of-words characteristics and the field labels, sorting the importance of the characteristics in the bag-of-words characteristics, and selecting the characteristics with the importance greater than a threshold value as important phrases in the field to be output;

the extension unit is used for extending the important phrases through inflexion deformation to generate an extended phrase set;

and the neighbor searching unit is used for searching adjacent domain phrases in the sample sentence by using any phrase in the extended phrase set, and adding the adjacent domain phrases into the extended phrase set to obtain the domain phrase set.

According to a third aspect of the present disclosure, there is provided an electronic device comprising:

one or more processors;

a memory for storing one or more programs;

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method as provided by the first aspect of the disclosure.

According to a fourth aspect of the present disclosure, there is provided a computer readable medium having stored thereon executable instructions which, when executed by a processor, cause the processor to perform the method as provided by the first aspect of the present disclosure.

The method utilizes the N-gram feature extraction and sequencing model to mine the domain phrases, and carries out expansion and neighbor search on the basis, thereby effectively solving the problems of low mining efficiency, small mining quantity and low accuracy rate of the existing domain phrase mining.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts. The foregoing and other objects, features and advantages of the application will be apparent from the accompanying drawings. Like reference numerals refer to like parts throughout the drawings. The drawings are not intended to be to scale as practical, emphasis instead being placed upon illustrating the subject matter of the present application.

FIG. 1 is a flowchart of a domain phrase mining method provided according to an embodiment of the present disclosure.

Fig. 2 is a schematic diagram of a domain phrase mining device according to an embodiment of the present disclosure.

Fig. 3 is a schematic structural diagram of an electronic device provided according to an embodiment of the present disclosure.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Meanwhile, relational terms such as "first," "second," and the like may be used solely in the description herein to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Further, the term "and/or" in the present application is only one kind of association relationship describing the associated object, and means that three kinds of relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone.

Exemplary embodiments of the present disclosure will be described below with reference to the accompanying drawings. In the interest of clarity and conciseness, not all features of an actual embodiment are described in the specification. It will of course be appreciated that in the development of any such actual embodiment, numerous implementation-specific decisions may be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which will vary from one implementation to another.

Here, it should be further noted that, in order to avoid obscuring the present disclosure with unnecessary details, only the device structure closely related to the scheme according to the present disclosure is shown in the drawings, and other details not so related to the present disclosure are omitted.

It is to be understood that the disclosure is not limited to the described embodiments, as described below with reference to the drawings. In this context, embodiments may be combined with each other, features may be replaced or borrowed between different embodiments, one or more features may be omitted in one embodiment, where feasible.

The method for mining the phrases in the prior art mainly comprises two types, one type is an unsupervised mining method, and comprises the steps of discovering new words by using methods such as mutual information, freedom degree and the like, and expanding the new words by using methods such as clustering, near word expansion, keyword deformation and the like based on seed keywords, so as to obtain the new words; the other type is a supervised mining method, which comprises the steps of counting phrase weights based on f-ngram-idf, textrank and the like, selecting phrases with larger weights as candidate sets of newly found phrases, and performing new phrase discovery by using a classification or sequence labeling model and the like.

In order to solve the problems, the method utilizes an N-gram feature extraction and feature importance ranking model to mine the domain phrases, and carries out expansion and neighbor search on the mined phrases, so that the efficiency of mining the domain phrases is effectively improved, and the quantity and accuracy of the mined phrases are ensured.

First, a domain phrase mining method is provided in the embodiments of the present disclosure, and a specific description is given below of a domain phrase mining method provided in the embodiments of the present disclosure.

FIG. 1 illustrates a flow diagram of a domain phrase mining method 100 provided in accordance with an embodiment of the present disclosure. The method specifically comprises the following steps:

step 110: and extracting N-gram characteristics of the sample sentences with the field labels, and selecting the N-gram characteristics with the frequency greater than a preset value as a word list.

Here, the field label is a label indicating whether the sample sentence belongs to a certain field, for example, the field label may be a binary label in the embodiment of the present disclosure, and when the sample sentence belongs to a certain field, the label is set to "1", and when the sample sentence does not belong to a certain field, the label is set to "0". It should be noted that the binary label is only an example provided by the embodiment of the present disclosure, and the form of the domain label is not specifically limited by the present disclosure.

The N-gram is an algorithm based on a statistical language model, and can perform sliding window operation with the size of N on the content in the text according to bytes, so that N-gram characteristics with the length of N are formed. And performing N-gram feature extraction on the sample statement, namely inputting the sample statement into an N-gram model and forming an N-gram feature with the length of N. For example, the sentence of 'Xiaoming bus sitting at the bus to go to school' is subjected to N-gram feature extraction, and when the N-gram feature length N is 2, the 2-gram features extracted from the sentence are { 'Xiaoming', 'Ming sitting at the bus', 'bus' and 'bus-to-bus', 'go to the bus', 'go to school' }.

In the embodiment of the disclosure, the sample sentences with the domain labels can be provided as shown in table 1, the sample sentences in table 1 are subjected to N-gram feature extraction, and then the N-gram features with the frequency greater than the preset value are selected as the vocabulary.

TABLE 1 sample statement with Domain tag

Sample sentence	Domain label (1 for entertainment, 0 for non-entertainment)
		Chinese television actor	1
American actor Lieqiu-four present the awards ceremony	1
		The king wu of France actor present the awards ceremony and the Zhang liu of Japanese singer	1
China sports representative team attending 31 st summer Olympic Games in Ri Yong Lu in Lu	0
		American sports representative team attending 31 st summer Olympic Games in Rio Rev Luo open curtain	0
German sports representative team attending 31 st summer Olympic Games in Riyote Renlu-Luo-Kai	0
		Japanese sports representative team attending 31 st summer Olympic Games in Rio Rev-Luo-Fair	0

In some cases, the characteristic length N of the N-gram features can be selected to be 2-4. And (3) extracting N-gram features with the feature length of 2-4 from all sample sentences in the table 1, summarizing all the extracted N-gram features and counting the frequency of each N-gram feature, wherein the frequency is the occurrence frequency. The N-gram features with the frequency greater than the preset value are selected as a vocabulary, in the embodiment of the disclosure, the preset value can be 5, and then the vocabulary consisting of the N-gram features with the frequency greater than 5 is selected as { "sports", "actor", "curtain", "ceremony", "attendance", "awarding", "olympic ceremony", "China", "United states", "Germany" }.

It should be noted that the sample sentence and the domain label provided in the embodiment of the disclosure are only examples, and those skilled in the art may select other sample sentences and domain labels according to the needs, and the disclosure does not limit this. The person skilled in the art can also extract N-gram features with other feature lengths from the sample sentence, and the length of the N-gram features is not limited by the disclosure.

In some cases, the sample sentences can be segmented, new words are obtained through an unsupervised method based on the segmentation, and the new words are combined with the selected N-gram features with the frequency greater than the preset value to form a word list. In the embodiment of the present disclosure, the unsupervised method may be a mutual information calculation or clustering method, and the present disclosure does not specifically limit this.

For example, a sample sentence "the awards ceremony is presented by the french actor king puppy, and the japanese singer zhaowu is participled, and the generated participle includes" wangbai ", assuming that two new words of" wangbai "and" lie puppy "are available by an unsupervised method based on the participle. At this time, two new words of "wang xiao wu" and "li xiao si" may be combined with the selected N-gram feature with the frequency greater than 5 to form a word list { "sports", "actor", "screen", "ceremony", "attendance", "awards", "olympic games", "china", "usa", "germany", "wang xiao wu", "li xiao si".

It should be noted that the above is only an example, and in the embodiment of the present disclosure, all sample sentences are subjected to word segmentation, a word segmentation set is obtained after combining and de-duplicating word segmentation results, a new word is obtained through an unsupervised method based on each word segmentation in the word segmentation set, and all the obtained new words are combined with the selected N-gram features with the frequency greater than the preset value to form a word list.

In the embodiment of the disclosure, a new word obtained based on the word segmentation of the sample sentence can effectively supplement a word list formed by N-gram characteristics, and the omission of low-frequency number field characteristics caused by only selecting the N-gram characteristics with the frequency greater than a preset value is avoided.

Step 120: and traversing the sample sentence based on the word list to generate the bag-of-words feature comprising the feature and the word frequency vector of the feature.

In the embodiment of the present disclosure, the feature of the bag-of-words feature may be a word in a word list. Traversing the sample sentences based on the word list, wherein each sample sentence can be traversed sequentially based on the word list, if the word at the corresponding position in the word list appears in the sample, the times of the word appearing in all the samples are further counted, and the corresponding position is represented by the times of the word appearing in all the samples; if the word at the corresponding position in the word list does not appear in the sample, the corresponding position is represented by 0, so that a word frequency vector of each sample sentence can be generated, and the dimension of the vector is consistent with the number of the words in the word list.

For example, in the embodiment of the present disclosure, each sample statement in table 1 is traversed based on a word table, and an obtained word frequency vector is shown in table 2; in this example, the features in the generated bag of words features are words in a vocabulary.

TABLE 2 word frequency vector for sample sentence

Sample sentence	Word frequency vector
		Chinese television actor	[0,0,0,0,0,0,0,0,2,0,0,0,0]
American actor Lieqiu-four present the awards ceremony	[0,2,0,2,6,2,2,0,0,2,0,0,1]
		The king wu of France actor present the awards ceremony and the Zhang liu of Japanese singer	[0,2,0,2,6,2,2,0,0,0,0,1,0]
China sports representative team attends the 31 st summer Olympic GamesInterior heat and interior heat curtain type	[4,0,4,0,6,0,0,4,2,0,0,0,0]
		American sports representative team attending 31 st summer Olympic Games in Rio Rev Luo open curtain	[4,0,4,0,6,0,0,4,0,2,0,0,0]
German sports representative team attending 31 st summer Olympic Games in Riyote Renlu-Luo-Kai	[4,0,4,0,6,0,0,4,0,0,1,0,0]
		Japanese sports representative team attending 31 st summer Olympic Games in Rio Rev-Luo-Fair	[4,0,4,0,6,0,0,4,0,0,0,0,0]

In some cases, the obtaining form of the word frequency vector of the sample statement may also be that each sample statement is sequentially traversed based on a word list, and if a word at a corresponding position in the word list appears in the sample, the corresponding position is represented by 1; if the word in the corresponding position in the word list does not appear in the sample, the corresponding position is represented by 0, so that another word frequency vector of each sample sentence can be generated, and the dimension of the vector is consistent with the number of the words in the word list.

Step 130: inputting the bag-of-words characteristics and the field labels into a sorting model, sorting the importance of the characteristics in the bag-of-words characteristics by the sorting model, and selecting the characteristics with the importance greater than a threshold value as important phrases in the field to be output.

In the embodiment of the present disclosure, the ranking model may be a GBDT model, or may be another model capable of ranking the importance of features in the bag-of-words feature, which is not limited in the present disclosure.

And when the sequencing model is the GBDT model, inputting the domain labels of the sample sentences and the bag-of-words features acquired in the step 120 into the GBDT model, wherein the GBDT model can sequence the importance of the features in the bag-of-words features, and selects the features with the importance greater than a certain threshold value as important phrases in the domain to be output.

Step 140: and expanding the important phrases through inflexion deformation to generate an expanded phrase set.

In the embodiment of the disclosure, the important phrases are expanded by changing the pronunciation, and the important phrases can be expanded by replacing homophones; the method can also be expanded by replacing approximate sounds, for example, the vowels { ("ing", "in"), ("eng", "en"), ("ang", "an") }, which are difficult to distinguish, are replaced.

In the embodiment of the disclosure, the important phrase is expanded through deformation, and the important phrase can be expanded through inquiring a four-corner number table and selecting the Chinese character with the same code as the Chinese character in the important phrase as a substitute. For example, if an important phrase includes a "peak" word, and the four corner number of the "peak" word is 27754, other Chinese characters corresponding to the code 27754, such as "", are randomly selected, and the phrase " meeting" can be expanded.

It should be noted that the inflexion method provided above is only an example, and those skilled in the art may select other inflexion methods capable of expanding phrases according to needs, and the disclosure does not limit this.

In some cases, some existing domain phrases may be held in advance, and at this time, the important phrases acquired in step 130 and the existing domain phrases may be merged to obtain an initial phrase set, and the initial phrase set is extended by inflexion to generate an extended phrase set. In this way, the important phrases acquired in step 130 can be supplemented by fully utilizing the existing domain phrases, thereby effectively increasing the number of the mined domain phrases.

In view of the diversity of the network language, important phrases or domain phrases are expanded by adopting a method of sound variation and deformation, so that the quantity of the excavated domain phrases can be further effectively increased, and the excavation efficiency is improved; and because the inflexion deformation word usually has very high accuracy, therefore, the accuracy of the phrase expanded by the method is also high, and the expanded phrase is applied to identify the sentence to be identified on the network, so that the identification accuracy and recall rate of the related phrase in the field can be effectively improved.

Step 150: and searching adjacent domain phrases in the sample sentence by using any phrase in the extended phrase set, and adding the adjacent domain phrases into the extended phrase set to obtain the domain phrase set.

In the embodiment of the disclosure, searching adjacent domain phrases in a sample sentence by using any phrase in an extended phrase set may include performing word segmentation and word segmentation on the sample sentence, and generating a corresponding word vector and a corresponding word vector; vectorizing the phrases in the extended phrase set to obtain a vector corresponding to any phrase; and calculating the similarity between the vector corresponding to any phrase and the word segmentation vector generated by the sample sentence, and selecting the word segmentation with the similarity larger than a preset value as the field phrase adjacent to the phrase.

In the embodiment of the disclosure, after the word segmentation and the word segmentation are performed on the sample sentence, word vectors and word vectors corresponding to the word segmentation and the word segmentation can be generated by adopting a word2vec method.

It should be noted that the word segmentation and word segmentation vectorization of the sample sentence by using the word2vec method are only an example provided in the embodiment of the present disclosure, and those skilled in the art may also select other vectorization methods according to the requirement, which is not limited by the present disclosure.

In the embodiment of the present disclosure, the phrases in the extended phrase set are vectorized to obtain a vector corresponding to any phrase, and the following method may be adopted:

in some cases, if a certain phrase in the extended phrase set is included in the word segmentation result of the sample sentence, the word segmentation vector may be directly adopted as the vector of the phrase, for example, if the phrase "actor" in the extended phrase set is included in the word segmentation result of the sample sentence, the word segmentation vector of the word segmentation of the "actor" may be directly adopted as the vector of the phrase "actor" in the extended phrase set.

In some cases, if a certain phrase in the extended phrase set is not included in the word segmentation result of the sample sentence, but the phrase is a combination of the word segmentation in the sample sentence or a combination of the word segmentation and the word segmentation, the vector of the phrase may be calculated by using the vector corresponding to the word segmentation or the word segmentation, for example, the vectors corresponding to the word segmentation or the word segmentation may be summed in each dimension and then averaged, thereby obtaining the vector of the phrase.

In some cases, if a certain phrase in the extended phrase set is not included in the word segmentation result of the sample sentence, and some words in the phrase do not appear in the sample sentence, then vector representation is performed on the words by using other methods, for example, after the words are converted into special symbols, vectors corresponding to the special symbols are queried from a built-in vector table, then vectors corresponding to other words in the phrase are summed in each dimension, and then the average value is taken, so as to obtain the vector of the phrase.

In the embodiment of the present disclosure, the similarity between the vector corresponding to any phrase and the word segmentation vector generated by the sample sentence is calculated, and the word segmentation with the similarity greater than the preset value is selected as the field phrase adjacent to the phrase. Further, the obtained adjacent domain phrases may be added to the extended phrase set, so as to obtain a final mined domain phrase set.

It should be noted that the method for searching for neighboring domain phrases by calculating similarity using cosine distances provided in the present disclosure is only an example, and those skilled in the art may also select other methods that can search for neighboring domain phrases in a sample sentence using phrases in an extended phrase set according to the needs, and the present disclosure is not limited thereto.

The method for mining the field phrases based on the sample sentences is provided, and on the basis of obtaining the field phrase set, the embodiment of the disclosure further provides a method for identifying the field of a new sentence by applying the field phrase set.

For example, suppose that in the case where the obtained entertainment domain phrase set is { "actor", "ceremony", "attendance", "awards" } that { "actor", "ceremony" is a phrase pair in the domain phrase set, the phrase pair appears 2 times in the entertainment domain sample sentence provided in table 1 of the embodiment of the present disclosure, and the entertainment domain sample data is 3 pieces in total, and therefore, the frequency of appearance of the phrase pair in the entertainment domain sample sentence is 0.667. If the preset value of the frequency is set to 0.6, phrase pairs { "actor", "ceremony" } with the frequency exceeding the preset value can be selected to perform field recognition on a new sentence, and if the selected phrase pairs { "actor", "ceremony" } are included in the new sentence, the new sentence is marked as belonging to the entertainment field. The high-frequency domain phrase pairs in the sample sentences are used for carrying out domain recognition on the new sentences, so that the recognition accuracy and efficiency can be effectively improved.

The preset value of any phrase in the field phrase set selected in the embodiments of the present disclosure for the frequency appearing in the sample sentence in the field is merely an example, and those skilled in the art may select other suitable preset values as needed, which is not limited by the present disclosure.

A domain phrase mining apparatus provided in an embodiment of the present disclosure is described below. FIG. 2 shows a schematic diagram of a domain phrase mining device 200 provided in accordance with an embodiment of the present disclosure. The device specifically includes:

the word list construction unit 201 is used for extracting N-gram characteristics of the sample sentences with the domain labels and selecting the N-gram characteristics with the frequency greater than a preset value as word lists;

a bag-of-words feature generation unit 202, configured to traverse the sample sentence based on the vocabulary, and generate bag-of-words features including features and a frequency vector of the features;

the sorting unit 203 is used for receiving the bag-of-words features and the field labels, sorting the importance of the features in the bag-of-words features, and selecting the features with the importance greater than a threshold value as important phrases in the field to be output;

an expansion unit 204, configured to expand the important phrases through inflexion and deformation to generate an expansion phrase set;

a neighboring search unit 205, configured to search neighboring domain phrases in the sample sentence by using any phrase in the extended phrase set, and add the neighboring domain phrases into the extended phrase set to obtain a domain phrase set.

The method and the device for mining the domain phrases provided by the embodiment of the disclosure effectively expand the domain phrases based on the sample sentences with the domain tags, have high accuracy of the mined domain phrases, and are suitable for performing domain identification on complex and various network languages.

Fig. 3 shows a schematic structural diagram of an electronic device 300 provided according to an embodiment of the present disclosure. As shown in fig. 3, the electronic apparatus 300 includes a Central Processing Unit (CPU) 301 that can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM) 302 or a program loaded from a storage section 308 into a Random Access Memory (RAM) 303. In the RAM 303, various programs and data necessary for the operation of the electronic apparatus are also stored. The CPU 301, ROM 302, and RAM 303 are connected to each other via a bus 304. An input/output (I/O) interface 305 is also connected to bus 304.

The following components are connected to the I/O interface 305: an input portion 306 including a keyboard, a mouse, and the like; an output section 307 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 308 including a hard disk and the like; and a communication section 309 including a network interface card such as a LAN card, a modem, or the like. The communication section 309 performs communication processing via a network such as the internet. A drive 310 is also connected to the I/O interface 305 as needed. A removable medium 311 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 310 as necessary, so that a computer program read out therefrom is mounted into the storage section 308 as necessary.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer-readable medium bearing instructions that, in such embodiments, may be downloaded and installed from a network via the communication section 309, and/or installed from the removable media 311. The instructions, when executed by the Central Processing Unit (CPU) 301, perform the various method steps described in the present invention.

The above-mentioned embodiments are merely specific embodiments of the present disclosure, which are used for illustrating the technical solutions of the present disclosure and not for limiting the same, and the scope of the present disclosure is not limited thereto, although the present disclosure is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive of the technical solutions described in the foregoing embodiments or equivalent technical features thereof within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the exemplary embodiments of the present disclosure, and should be construed as being included therein.

Claims

1. A domain phrase mining method, comprising:

searching adjacent domain phrases in the sample sentence by using any phrase in the extended phrase set, and adding the adjacent domain phrases into the extended phrase set to obtain a domain phrase set;

2. The method of claim 1, further comprising:

3. The method of claim 2, wherein deriving new words based on the participles comprises deriving new words by an unsupervised method.

4. The method of any of claims 1-3, the N-gram features having a feature length N of 2-4.

5. The method of claim 1, further comprising:

6. The method of claim 1, wherein searching for neighboring domain phrases in the sample sentence using any phrase in the set of augmented phrases comprises:

7. An electronic device, comprising:

one or more processors;

a memory for storing one or more programs;

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method of any of claims 1-6.

8. A computer readable medium having stored thereon executable instructions which, when executed by a processor, cause the processor to perform the method of any one of claims 1 to 6.