CN112101020A

CN112101020A - Method, device, equipment and storage medium for training key phrase identification model

Info

Publication number: CN112101020A
Application number: CN202010880346.0A
Authority: CN
Inventors: 杨虎; 汪琦; 王述; 张晓寒; 冯知凡; 柴春光
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-08-27
Filing date: 2020-08-27
Publication date: 2020-12-18
Anticipated expiration: 2040-08-27
Also published as: CN112101020B

Abstract

The application discloses a method, a device, equipment and a storage medium for training a key phrase identification model, and relates to the fields of artificial intelligence, knowledge maps and deep learning. The method for training the key phrase identification model comprises the following steps: acquiring first training data related to a target field, wherein key phrases related to the target field in a first training text of the first training data are identified; acquiring general training data which is not related to a target field, wherein key phrases which are not related to the target field in a general training text of the general training data are identified; and training a key phrase identification model aiming at the target field based on the first training data and the general training data so as to identify the text to be identified related to the target field. In this way, with a small amount of identified data in the target domain, an accurate key phrase identification model for the target domain may be obtained.

Description

Method, device, equipment and storage medium for training key phrase identification model

Technical Field

The present disclosure relates to the field of data processing, in particular, to the field of artificial intelligence, knowledge-graphs, and deep learning, and more particularly, to methods, apparatuses, devices, and storage media for training key phrase identification models.

Background

With the development of computer technology, technical solutions capable of processing various data based on machine learning technology have been developed. For example, text may already be processed using machine learning techniques to identify key phrases in the text. For example, for a video, the title and introduction text may contain key phrases that understand the content of the video. However, since the texts may belong to different domains, and the features of the different domains are different, when performing key phrase identification on texts related to a certain domain, an identification model for the specific domain may be required, and a large amount of manually identified data is required to train the identification model.

Disclosure of Invention

The present disclosure provides a method, apparatus, device, and storage medium for training a key phrase identification model.

According to a first aspect of the present disclosure, a method for training a key phrase identification model is provided. The method includes obtaining first training data related to a target field, wherein key phrases related to the target field in a first training text of the first training data are identified. The method also includes obtaining generic training data that is not related to the target domain, wherein key phrases in the generic training text of the generic training data that are not related to the target domain are identified. The method further comprises training a key phrase identification model for the target field based on the first training data and the general training data for identifying the text to be identified related to the target field.

According to a second aspect of the present disclosure, a method for identifying key phrases in text to be identified is provided. The method comprises the step of obtaining a text to be identified related to a target field. The method further comprises identifying key phrases in the text to be identified that are related to the target domain using a key phrase identification model trained according to the method of the first aspect of the present disclosure.

According to a third aspect of the present disclosure, an apparatus for training a key phrase identification model is provided. The apparatus includes a first training data acquisition module configured to acquire first training data related to a target field, wherein a key phrase related to the target field in a first training text of the first training data is identified. The apparatus also includes a generic training data acquisition module configured to acquire generic training data that is not related to the target domain, wherein key phrases in the generic training text of the generic training data that are not related to the target domain are identified. The device further comprises a model training module configured to train a key phrase identification model for the target field based on the first training data and the general training data for identifying a text to be identified related to the target field.

According to a fourth aspect of the present disclosure, an apparatus for identifying key phrases in text to be identified is provided. The device includes: and the text to be identified acquisition module is configured to acquire a text to be identified related to the target field. The device also includes: a to-be-identified text identification module configured to identify a key phrase in the to-be-identified text that is related to the target domain using the key phrase identification model trained according to the method of the first aspect of the present disclosure.

According to a fifth aspect of the present disclosure, there is provided an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method according to the first aspect of the disclosure.

According to a sixth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method according to the first aspect of the present disclosure.

According to the technology of the application, a small amount of identified data in the target field is utilized, and the accurate key phrase identification model aiming at the target field can be obtained for training.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. In the drawings, like or similar reference characters designate like or similar elements, and wherein:

FIG. 1 illustrates a schematic diagram of an example system 100 in which various embodiments of the present disclosure can be implemented;

FIG. 2 illustrates a flow diagram of a method 200 for training a key phrase identification model in accordance with some embodiments of the present disclosure;

FIG. 3 illustrates an example environment 300 for obtaining a key phrase identification model for a target domain in accordance with some embodiments of the present disclosure;

FIG. 4 illustrates a flow diagram of a method 400 of obtaining a key phrase identification model for a target domain in accordance with some embodiments of the present disclosure;

FIG. 5 illustrates a flow diagram of a method 500 of obtaining second training data, in accordance with some embodiments of the present disclosure;

FIG. 6 illustrates an example environment 600 for training a key phrase identification model in accordance with some embodiments of the present disclosure;

FIG. 7 illustrates a schematic diagram of a method 700 for identifying key phrases in text to be identified, in accordance with some embodiments of the present disclosure;

FIG. 8 shows a schematic block diagram of an apparatus 800 for training a key phrase identification model in accordance with an embodiment of the present disclosure;

FIG. 9 shows a schematic block diagram of an apparatus 900 for training a key phrase identification model in accordance with an embodiment of the present disclosure; and

fig. 10 illustrates a block diagram of an electronic device 1000 capable of implementing multiple embodiments of the present disclosure.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In describing embodiments of the present disclosure, the terms "include" and its derivatives should be interpreted as being inclusive, i.e., "including but not limited to. The term "based on" should be understood as "based at least in part on". The term "one embodiment" or "the embodiment" should be understood as "at least one embodiment". The terms "first," "second," and the like may refer to different or the same object. Other explicit and implicit definitions are also possible below.

The scheme for extracting the key phrases by using the rule mode has poor migratability and limited solved problems. When the text in different fields is faced, corresponding new rules need to be added according to the characteristics of the fields, and meanwhile, the written rules can be faced with the contradictory conditions. Traditional machine learning based schemes require manual design of valid features. However, such features are difficult to be applied to different fields, and for training data required by model training, there may be some specific fields, the amount of training data that can be collected is small, and the cost of training data collection time is high, so that an accurate identification model cannot be trained. Moreover, even if a sufficient amount of training data can be collected, the collected training data is often manually identified, which is costly and requires the involvement or guidance of domain experts for some specialized domains.

To address, at least in part, one or more of the above problems and other potential problems, embodiments of the present disclosure propose a technical solution for training a key phrase identification model. In this scheme, by using both training data related to a target field and general training data to train on the basis of a key phrase identification model for the target field, an identification model for a key phrase related to the target field in a text to be identified related to the target field can be obtained. In this way, a key phrase identification model for a target domain may be trained to be accurate for a particular target domain, utilizing only a small amount of identified data in the target domain, thereby reducing the collection and identification costs of data. In addition, the scheme can be applied to different multiple fields, and the key phrase identification models for the multiple fields can be obtained only by preparing corresponding multiple types of training data for the multiple fields.

Embodiments of the present disclosure will be described below in detail with reference to the accompanying drawings. Herein, the term "field" may refer to a range of specialized activities or businesses, examples of which may include, but are not limited to, military, literature, art, medical, sports, and the like. The term "model" may learn from training data the associations between respective inputs and outputs, such that after training is complete, for a given input, a corresponding output may be generated. It should be understood that a "model" may also be referred to as a "neural network," a "learning model," or a "learning network. The term "key phrase" may sometimes be referred to as a "named entity," which may refer to one or more keywords that occur in a piece of content. "key phrases" may be determined based on the user's intent, examples of which may include, but are not limited to, person names, place names, book names, health-related entities, middle-beauty-related entities, and so forth.

Fig. 1 illustrates a schematic diagram of an example system 100 in which various embodiments of the present disclosure can be implemented. System 100 may generally include a model training subsystem 110 and a model application subsystem 120. It should be understood that the description of the structure and function of the system 100 is for exemplary purposes only and does not imply any limitation as to the scope of the disclosure. Embodiments of the present disclosure may also be applied to environments with different structures and/or functions.

In the model training subsystem 110, the model training device 111 may acquire the first training data 101 and the generic training data 102. It is understood that the first training data 101 is for a target domain, while the generic training data 102 may be for various domains.

The model training device 111 may be trained using the first training data 101 and the generic training data 102, so that the trained key phrase identification model 103 for the target domain may accurately identify the text to be identified related to the target domain. The training process based on these two data may be based on the initial identification model 112. Before training with both data, the initial identification model 112 may be trained with training data related to the target domain, but not identified, to obtain the key phrase identification model 103. In this way, the requirements for the number of identified training samples in the target domain may be reduced. Further, training may be performed based on a small number of training samples (i.e., the first training data 101) in the target domain, i.e., an accurate final model for the target domain may be obtained.

In the model application subsystem, the model application means 121 may acquire the text to be identified 104, which is related to the target domain. The text 104 to be identified is input into the key phrase identification model 103 for the target field for processing, and then the model application means 121 may output the identification result 105 for the text 104 to be identified, and the identification result 1052 may identify the key phrase related to the target field in the text to be identified.

For ease and clarity of illustration, embodiments of the present disclosure will be described below with reference to the system 100 of fig. 1. It is to be understood that embodiments of the present disclosure may also include additional acts not shown and/or may omit acts shown. The scope of the present disclosure is not limited in this respect. For ease of understanding, specific data mentioned in the following description are exemplary and are not intended to limit the scope of the present disclosure.

FIG. 2 illustrates a flow diagram of a method 200 for training a key phrase identification model in accordance with some embodiments of the present disclosure. For example, the method 200 may be performed by the model training device 111 as shown in FIG. 1. The various actions of method 200 are described in detail below in conjunction with fig. 1. It is to be understood that method 200 may also include additional acts not shown and/or may omit acts shown. The scope of the present disclosure is not limited in this respect.

At block 202, the model training device 111 obtains first training data 101 related to the target domain, wherein key phrases related to the target domain in the first training text of the first training data 101 are identified.

To illustrate that the target field is military, the first training data 101 may include a plurality of first training texts related to military. The first training text may be a sentence or a paragraph of a word related to military affairs. For example, the first training text may be extracted from items of encyclopedia data or knowledge graph of the military domain in which military-related key phrases are identified.

In some embodiments, for example, key phrases in text may be identified using B (begin), I (inside), O (out) tags, where the B tags are used to identify the starting characters of the key phrases, the I tags are used to identify characters in the key phrases other than the first starting character, and the O tags are used to identify other characters in the sentence that do not belong to the key phrase. In addition to the first training text, the "identified" text referred to herein may be identified using this method.

In some other embodiments of the present disclosure, other labels besides BIO labels may also be utilized to identify key phrases in training texts, and the scope of the present disclosure is not limited in this respect.

In some embodiments, the first training data 101 is obtained by the following steps. First, a first training text related to the target field is acquired, and the number of the first training texts may be a small number, for example, on the order of hundreds or thousands. The first training text is then identified such that key phrases in the first training text that are related to the target domain are identified, which in some embodiments may be done manually. Next, first training data 101 may be generated based on the identified first training text.

In the above manner, the model training device 111 may obtain a plurality of training texts associated with the target domain and identified.

At block 204, the model training device 111 obtains generic training data 102 that is not related to the target domain, wherein key phrases in the generic training text of the generic training data 102 that are not related to the target domain are identified.

It is understood that, since the generic training data 102 is not necessarily related to a specific target domain, the number of generic training texts in the generic training data 102 that can be obtained by the model training device 111 may be much larger than the number of the first training texts. In some embodiments, the generic training data 102 may employ text that has been identified in other key phrase identification tasks. In some other embodiments, the generic training data 102 may also be mined and collected in the following manner. First, the model training device 111 may obtain a plurality of training texts with user labels, for example, in a database such as a video website or encyclopedia data or question and answer website, a large amount of data (which contains corresponding texts, for example, video titles) is identified with labels identified by the user at the time of uploading or publishing, the labels may be generally understood and described by the user for the data content and thus generally have high relevance to the data content, and the model training device 111 may obtain a plurality of training texts associated with the data and a corresponding plurality of labels. The model training device 111 may then identify a plurality of training texts such that a corresponding plurality of key phrases in the plurality of training texts is identified. The model training device 111 may identify the training texts by using a preliminary identification model to understand the contents of the training texts. Next, the model training device 111 selects a common training text from the plurality of identified training texts based on the user label and the plurality of key phrases, for example, the model training device 111 may filter the plurality of training texts by combining information such as usage frequency, number of labels, and part-of-speech of the labels of the plurality of training texts based on the degree of matching between the user label and the plurality of identified key phrases, so as to select the common training text therefrom. Finally, the model training device 111 may generate the generic training data 102 based on the generic training text.

At block 206, the model training device 111 trains a key phrase identification model for the target domain based on the first training data 101 and the generic training data 102. The resulting key phrase identification model 103 may be used to identify text 104 to be identified that is relevant to the target domain.

The model training device 111 may train the key phrase identification model 103 for the target domain by using both the identified first training data 101 that contains a small amount of training data but is strongly related to the target domain and the identified general training data 102 that contains a large amount of training data but is weakly related to the target domain, so as to adjust the key phrase identification model 103 and update the parameters thereof, so that the model is more suitable for the target domain. The acquisition process of the key phrase identification model 103 for the target domain will be described in detail below with reference to fig. 3 and 4.

In some embodiments, the training of the key-phrase identification model 103 includes iteratively performing, by the model training device 111, the following steps until the identification accuracy of the key-phrase identification model 103 is above a predetermined threshold: training the key phrase identification model 103 using the first training data 101 to update the key phrase identification model 103; the updated key-phrase identification model 103 is then trained using the generic training data 102 to update the key-phrase identification model 103 again. By using the two types of training data for model training, the accuracy of identifying the model can be improved by using the general training data 102 as a supplement under the condition that the number of identified training samples in the first training data 101 for the target field is small, so that the model obtained by training can accurately identify the text 104 to be identified, particularly the text 104 to be identified related to the target field.

It will be appreciated that the number and order of times the first training data 101 and the generic training data 102 are used by the model training device 111 in the iterative process described above may be arbitrary, as long as both are used sufficiently many times in the training process that the accuracy of the final key phrase identification model 103 is above a predetermined threshold. For example, the model training device 111 may train the updated key-phrase identification model 103 once using the first training data 101, then train the updated key-phrase identification model 103 once using the common training data 102, and train the updated key-phrase identification model 103 once using the first training data 101, so as to repeat iterations. The model training device 111 may also train the updated key-phrase identification model 103 twice using the first training data 101 and train the updated key-phrase identification model 103 once using the common training data 102, and so on.

Therefore, according to the embodiment of the disclosure, the accurate key phrase identification model for the target field can be obtained by training only by using a small amount of identified data in the target field, so that the collection cost and the identification cost of data are reduced. Moreover, the scheme can be applied to different multiple fields, and the key phrase identification models for the multiple fields can be obtained only by preparing corresponding multiple types of training data for the multiple fields.

FIG. 3 illustrates an example environment 300 for obtaining a key phrase identification model for a target domain in accordance with some embodiments of the present disclosure. FIG. 4 illustrates a flow diagram of a method 400 of obtaining a key phrase identification model for a target domain in accordance with some embodiments of the present disclosure. For example, method 400 may be performed by model training device 111, as shown in FIG. 1, or model training device 311, as shown in FIG. 3. The various actions of method 400 are described in detail below in conjunction with FIG. 3. It is to be understood that method 400 may also include additional acts not shown and/or may omit acts shown. The scope of the present disclosure is not limited in this respect.

At block 402, the model training device 311 obtains second training data 301 associated with the target domain, the second training text of the second training data 301 being unidentified.

In particular, the second training data 301 is related to the target domain, and the second training text in the second training data 301 may be unidentified, so using such data may reduce the cost for generating the key phrase identification model 303. In some embodiments, the second training data 301 may be obtained from a database containing text, such as a knowledge graph, encyclopedia data, question and answer website, using text mining techniques, the process of which will be described in detail below with reference to fig. 5.

At block 404, the model training device 311 trains the initial identification model 312 based on the second training data 301 to obtain the key phrase identification model 303 for the target domain.

The initial identification model 312 may be an existing generic key phrase identification model 303 for, for example, BERT, herine models, etc., but these models may not have high identification accuracy for the text to be identified in the target domain. Specifically, the second training data 301 may be input into the initial identification model 312, and then parameters of the initial identification model 312 may be adjusted based on the output result to update the initial identification model 312. Repeating the above process, a key phrase identification model 303 for the target domain can be obtained.

In this way, by training with the second training data that is not identified, the key phrase identification model 303 for the target domain can be obtained as a basis for subsequent further training to identify the text to be identified, since the second training data does not need to be identified manually, a large amount of text available for collection exists in the target domain, and thus the cost of collecting and identifying the training data can be reduced.

Fig. 5 illustrates a flow diagram of a method 500 of acquiring second training data, in accordance with some embodiments of the present disclosure. Method 500 is an exemplary detailed process of block 402 in method 400. For example, the method 500 may be performed by the model training device 111 as shown in FIG. 1. The various actions of method 500 are described in detail below in conjunction with fig. 1. It is to be understood that method 500 may also include additional acts not shown and/or may omit acts shown. The scope of the present disclosure is not limited in this respect.

At block 502, the model training device 111 determines a plurality of key phrase types associated with the target domain.

Specifically, again taking the target domain as "military" as an example, the model training device 111 may determine a plurality of key phrase types related to the military domain, including but not limited to: weapons, airplanes, missiles, tanks, firearms, pistols, and the like.

At block 504, the model training device 111 determines a plurality of key-phrase entries from the database that are associated with a plurality of key-phrase types.

In some embodiments, the model training device 111 may filter a database such as a knowledge graph, encyclopedia data, question and answer website using the plurality of key phrase types, and filter a plurality of key phrase entries matching the plurality of key phrase types, for example, for a firearm type, the model training device 111 may determine an entry such as an M9 pistol, an M14 rifle, an M12S submachine gun, etc. associated with the firearm type.

At block 506, the model training device 111 obtains second training data based on the plurality of key phrase entries.

In some embodiments, the model training device 111 may extract a corresponding plurality of description texts from the plurality of key phrase entries determined in block 504 as second training texts, and generate second training data based on the second training texts. For example, when the database is encyclopedia data, the model training device 111 may extract the brief description text for each determined item as the second training text. Again, taking the military field as an example, when the determined key phrase entry is M9 pistol, the model training device 111 can extract the descriptive text "M9 pistol employs barrel short stroke recoil principle, locking mode is a chuck-sinking, single/double action trigger design, with 15-shot removable magazine feed, gun length 217mm, weight 1.1kg (including the loading cartridge holder), and bullet initial velocity 390M/s. M9 is simple in structure and reliable in mechanical action. The full gun life was greater than 5000 shots "as the second training text.

Since all key phrase types may not be determined at once for the target domain. Thus, in some embodiments, the step of determining a plurality of key phrase types in block 502 may specifically include the following steps.

First, the model training means 111 may determine candidate key phrase types associated with the target domain, for example, for the military domain, 5 candidate key phrase types related to the military domain may be determined, for example, weapons, airplanes, missiles, tanks, firearms. Then, using steps similar to those in block 504, the model training device 111 may determine candidate key phrase entries from the database that are associated with the candidate key phrase types, e.g., for a firearm type, the model training device 111 may determine entries such as M9 pistol, M14 rifle, M12S submachine gun, etc. that are associated with the firearm type. Based on the candidate key phrase entry, the model training device 111 may determine an extension key phrase type associated with the target domain, e.g., for an entry M9 handgun belonging to the firearm type, may determine that M9 handgun also belongs to the "handgun" type, the model training device 111 may count the total number of entries belonging to the "handgun" type in the candidate key phrase entry, and may treat "handgun" as the extension key phrase type if it is determined that the total number of entries is greater than a predetermined threshold. In this manner, several extended key-phrase types may be determined based on the candidate key-phrase entries. The model training device 111 may then determine a plurality of key phrase types based on the candidate key phrase types and the extended key phrase types, and further perform the steps described in

blocks

504 and 506 in the database to augment the second training data based on the plurality of key phrase types. In some embodiments, the plurality of key phrase types may include candidate key phrase types, and some of the extended key phrase types. The model training device 111 may filter the extended key phrase types through various rules to determine some of the above key phrase types, some examples of which include relevance to the military field and/or overlap with existing candidate key phrase types, etc.

It will be appreciated that the above process may be repeatedly performed to obtain a large amount of second training data related to the target domain to cover as much key phrase types as possible in the target domain. The above process may also be performed for different multiple domains to obtain second training data for the different multiple domains.

In this way, a large amount of text related to the target domain can be obtained from the existing database as the second training data for training the initial identification model 112 to obtain the key phrase identification model 103 for the target domain.

FIG. 6 illustrates an example environment 600 for training a key phrase identification model in accordance with some embodiments of the present disclosure. The specific process of training in block 206 is described below in conjunction with fig. 6.

The first training data 601 and the generic training data 602 may be used, respectively, to train a key phrase identification model for the target domain. The following describes an example of a training process using the first training text in the first training data 601.

First, first training data 601 is input to the model training device 611. The model training means 611 may pre-process the first training text in the first training data 601 to divide the sentence in the first training text into a plurality of characters or a plurality of words. A first training text containing a plurality of characters may then be input into the key phrase identification model for the target domain 603 to generate a first vector for the plurality of characters. These first vectors are then input into a recurrent neural network 604 for processing to generate second vectors, examples of the recurrent neural network 604 including, but not limited to, one-way LSTM (long short term memory artificial neural network) and two-way LSTM. The second vector is then processed using a model 605, such as a conditional random field, to generate labels (e.g., BIOs) for the plurality of characters. As described with respect to block 202 of fig. 2, since the first training text in the first training data has been identified, these labels may be compared to the labels of the first training data. Based on the comparison, the parameters may then be adjusted to update the key phrase identification model 603 for the target domain, and so on. It is understood that the training process of the model by the model training apparatus 611 using the training text in the generic training data 602 is similar and will not be described in detail herein.

FIG. 7 illustrates a schematic diagram of a method 700 for identifying key phrases in text to be identified, in accordance with some embodiments of the present disclosure. For example, method 700 may be performed by model application 121 as shown in FIG. 1. The various actions of method 700 are described in detail below in conjunction with fig. 1. It is to be understood that method 500 may also include additional acts not shown and/or may omit acts shown. The scope of the present disclosure is not limited in this respect.

At block 702, the model application 121 obtains the text 104 to be identified that is related to the target domain. For example, the text to be identified 104 may include any video title text, video description text, or user input text to be identified, among others.

At block 704, the model application means 121 identifies key phrases in the text 104 to be identified that are related to the target domain using the trained key phrase identification model 103.

In some embodiments, model application 121 may split the text to be identified into one or more phrases. The model application 121 may then utilize the key phrase identification model 103 to determine respective tags (e.g., BIO tags as discussed above) for the characters therein, and identify the key phrase in the sentence based on the respective tags.

FIG. 8 shows a schematic block diagram of an apparatus 800 for training a key phrase identification model in accordance with an embodiment of the present disclosure. As shown in fig. 8, the apparatus 800 may include a first training data acquisition module 802 configured to acquire first training data related to a target field, wherein a key phrase related to the target field in a first training text of the first training data is identified. The apparatus 800 may further include a generic training data acquisition module 804 configured to acquire generic training data that is not related to the target domain, wherein key phrases in the generic training text of the generic training data that are not related to the target domain are identified. The apparatus 800 may further include a model training module 806 configured to train a key phrase identification model for the target domain based on the first training data and the generic training data for identifying text to be identified that is related to the target domain.

In some embodiments, the apparatus 800 further comprises: a second training data acquisition module configured to acquire second training data related to the target field, a second training text of the second training data not being identified; and an initial model training module configured to train the initial identification model based on the second training data to obtain a key phrase identification model for the target domain.

In some embodiments, the second training data acquisition module further comprises: a key phrase type determination module configured to determine a plurality of key phrase types associated with a target domain; a key-phrase entry determination module configured to determine a plurality of key-phrase entries associated with a plurality of key-phrase types from a database; and wherein the second training data acquisition module is configured to acquire second training data based on the plurality of key phrase entries.

In some embodiments, the key phrase type determination module further comprises: a candidate key-phrase type determination module configured to determine a candidate key-phrase type associated with the target domain; a candidate key-phrase entry determination module configured to determine candidate key-phrase entries associated with a candidate key-phrase type from a database; an extended key phrase type determination module configured to determine an extended key phrase type associated with the target domain based on the candidate key phrase entries; and wherein the key phrase type determination module is configured to determine a plurality of key phrase types based on the candidate key phrase type and the extended key phrase type.

In some embodiments, the second training data acquisition module further comprises: a description text extraction module configured to extract a plurality of corresponding description texts from the plurality of key phrase entries as second training texts; and a second training data generation module configured to generate second training data based on the second training text.

In some embodiments, the second training data acquisition module further comprises: the first training text acquisition module is used for acquiring a first training text related to a target field; a first training text identification module configured to identify a first training text so that a key phrase in the first training text related to the target field is identified; and a first training text generation module configured to generate first training data based on the identified first training text.

In some embodiments, the generic training data acquisition module 804 further comprises: a general training text acquisition module configured to acquire a plurality of training texts having user labels; a generic training text identification module configured to identify a plurality of training texts such that a corresponding plurality of key phrases in the plurality of training texts are identified; a generic training text selection module configured to select a generic training text from the identified plurality of training texts based on the user tags and the plurality of key phrases; and a generic training text generation module configured to generate generic training data based on the generic training text.

In some embodiments, model training module 806 further comprises: a first training module configured to train the key phrase identification model with first training data to an updated key phrase identification model; and a second training module configured to train the updated key phrase identification model with the common training data to update the key phrase identification model again; wherein the model training module 806 is configured to cause the first training module and the second training module to run iteratively.

FIG. 9 shows a schematic block diagram of an apparatus 900 for training a key phrase identification model in accordance with an embodiment of the present disclosure. As shown in fig. 9, the apparatus 900 may include a text to be identified obtaining module 902 configured to obtain a text to be identified related to a target domain. The apparatus 900 may further include: and a to-be-identified text identification module 904 configured to the trained key phrase identification model, and identifying key phrases related to the target field in the to-be-identified text.

According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.

As shown in fig. 10, a block diagram of an electronic device 1000 for training a key phrase identification model according to an embodiment of the present application is shown. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 10, the electronic apparatus includes: one or more processors 1001, memory 1002, and interfaces for connecting the various components, including high-speed interfaces and low-speed interfaces. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). Fig. 10 illustrates an example of one processor 1001.

The memory 1002 is a non-transitory computer readable storage medium provided herein. Wherein the memory stores instructions executable by at least one processor to cause the at least one processor to perform the method of training a key phrase identification model provided herein. The non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to perform the method of training a key phrase identification model provided herein.

The memory 1002, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the method for training a key phrase identification model in the embodiments of the present application (e.g., the first training data acquisition module 802, the general training data acquisition module 804, and the model training module 806 shown in fig. 8). The processor 1001 executes various functional applications of the server and data processing, i.e., a method of training the key phrase identification model in the above method embodiments, by executing non-transitory software programs, instructions, and modules stored in the memory 1002.

The memory 1002 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the stored data area may store data created from use of the electronic device that trains the key phrase identification model, and the like. Further, the memory 1002 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 1002 may optionally include memory located remotely from the processor 1001, which may be connected to an electronic device that trains the key phrase identification model over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device of the method of training a key phrase identification model may further comprise: an input device 1003 and an output device 1004. The processor 1001, the memory 1002, the input device 1003, and the output device 1004 may be connected by a bus or other means, and the bus connection is exemplified in fig. 10.

The input device 1003 may receive input numeric or character information and generate key signal inputs related to user settings and function controls of the electronic device for training the key phrase identification model, such as a touch screen, keypad, mouse, track pad, touch pad, pointer stick, one or more mouse buttons, track ball, joystick, or other input device. The output devices 1004 may include a display device, auxiliary lighting devices (e.g., LEDs), and tactile feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

By training on the basis of the key phrase identification model for the target domain using both the training data related to the target domain and the generic training data, an identification model for a key phrase related to the target domain in the text to be identified related to the target domain can be obtained. In this way, a key phrase identification model for a target domain may be trained to be accurate for a particular target domain, utilizing only a small amount of identified data in the target domain, thereby reducing the collection and identification costs of data. In addition, the scheme can be applied to different multiple fields, and the key phrase identification models for the multiple fields can be obtained only by preparing corresponding multiple types of training data for the multiple fields.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and the present invention is not limited thereto as long as the desired results of the technical solutions disclosed in the present application can be achieved.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A method of training a key phrase identification model, comprising:

acquiring first training data related to a target field, wherein a key phrase related to the target field in a first training text of the first training data is identified;

obtaining general training data which is not related to the target field, wherein key phrases which are not related to the target field in general training texts of the general training data are identified; and

training a key phrase identification model for the target field based on the first training data and the general training data for identifying a text to be identified related to the target field.

2. The method of claim 1, further comprising:

acquiring second training data related to the target field, wherein a second training text of the second training data is not identified; and

training an initial identification model based on second training data to obtain the key phrase identification model for the target domain.

3. The method of claim 2, wherein acquiring the second training data comprises:

determining a plurality of key phrase types associated with the target domain;

determining a plurality of key-phrase entries associated with the plurality of key-phrase types from a database; and

obtaining the second training data based on the plurality of key phrase entries.

4. The method of claim 3, wherein determining the plurality of key phrase types comprises:

determining a candidate key-phrase type associated with the target domain;

determining candidate key-phrase entries associated with the candidate key-phrase types from the database;

determining an extended key-phrase type associated with the target domain based on the candidate key-phrase entries; and

determining the plurality of key-phrase types based on the candidate key-phrase type and the extended key-phrase type.

5. The method of claim 3, wherein obtaining the second training data based on the plurality of key phrase entries comprises:

extracting a plurality of corresponding description texts from the plurality of key phrase entries to serve as the second training texts; and

generating the second training data based on the second training text.

6. The method of claim 1, wherein acquiring the first training data comprises:

acquiring the first training text related to the target field;

identifying the first training text so that key phrases related to the target field in the first training text are identified; and

generating the first training data based on the identified first training text.

7. The method of claim 1, wherein obtaining the generic training data comprises:

acquiring a plurality of training texts with user labels;

identifying the plurality of training texts such that a corresponding plurality of key phrases in the plurality of training texts are identified;

selecting the generic training text from the identified plurality of training texts based on the user labels and the plurality of key phrases; and

generating the generic training data based on the generic training text.

8. The method of claim 1, wherein training the key phrase identification model comprises iteratively performing the following:

training the key phrase identification model with the first training data to update the key phrase identification model; and

training the updated key phrase identification model with the generic training data to update the key phrase identification model again.

9. A method for identifying key phrases in text to be identified, comprising:

acquiring a text to be identified related to a target field; and

identifying key-phrases in the text to be identified that are related to the target domain using the key-phrase identification model trained using the method of any one of claims 1 to 8.

10. An apparatus for training a key phrase identification model, comprising:

a first training data acquisition module configured to acquire first training data related to a target field, wherein a key phrase related to the target field in a first training text of the first training data is identified;

a general training data acquisition module configured to acquire general training data that is not related to the target field, wherein key phrases in general training texts of the general training data that are not related to the target field are identified; and

a model training module configured to train a key phrase identification model for the target field based on the first training data and the general training data, for identifying a text to be identified related to the target field.

11. The apparatus of claim 10, further comprising:

a second training data acquisition module configured to acquire second training data related to the target field, a second training text of the second training data being unidentified; and

an initial model training module configured to train an initial identification model based on second training data to obtain the key phrase identification model for the target domain.

12. The apparatus of claim 11, wherein the second training data acquisition module further comprises:

a key phrase type determination module configured to determine a plurality of key phrase types associated with the target domain;

a key-phrase entry determination module configured to determine a plurality of key-phrase entries associated with the plurality of key-phrase types from a database; and is

Wherein the second training data acquisition module is configured to acquire the second training data based on the plurality of key phrase entries.

13. The apparatus of claim 12, wherein the key phrase type determination module further comprises:

a candidate key-phrase type determination module configured to determine a candidate key-phrase type associated with the target domain;

a candidate key-phrase entry determination module configured to determine candidate key-phrase entries associated with the candidate key-phrase type from the database;

an extended key phrase type determination module configured to determine an extended key phrase type associated with the target domain based on the candidate key phrase entry; and is

Wherein the key phrase type determination module is configured to determine the plurality of key phrase types based on the candidate key phrase types and the extended key phrase types.

14. The apparatus of claim 12, wherein the second training data acquisition module further comprises:

a description text extraction module configured to extract a plurality of corresponding description texts from the plurality of key phrase entries as the second training texts; and

a second training data generation module configured to generate the second training data based on the second training text.

15. The apparatus of claim 10, wherein the second training data acquisition module further comprises:

the first training text acquisition module is used for acquiring the first training text related to the target field;

a first training text identification module configured to identify the first training text so that key phrases in the first training text that are related to the target field are identified; and

a first training text generation module configured to generate the first training data based on the identified first training text.

16. The apparatus of claim 10, wherein the generic training data acquisition module further comprises:

a general training text acquisition module configured to acquire a plurality of training texts having user labels;

a generic training text identification module configured to identify the plurality of training texts such that a corresponding plurality of key phrases in the plurality of training texts are identified;

a generic training text selection module configured to select the generic training text from the identified plurality of training texts based on the user labels and the plurality of key phrases; and

a generic training text generation module configured to generate the generic training data based on the generic training text.

17. The apparatus of claim 10, wherein the model training module further comprises:

a first training module configured to train the key phrase identification model with the first training data to update the key phrase identification model; and

a second training module configured to train the updated key phrase identification model with the common training data to update the key phrase identification model again;

wherein the first training module and the second training module run iteratively.

18. An apparatus for identifying key phrases in text to be identified, comprising:

the text to be identified acquisition module is configured to acquire a text to be identified related to the target field; and

a text to be identified identification module configured to identify key phrases in the text to be identified that are related to the target field using the key phrase identification model trained according to the method of any one of claims 1 to 8.

19. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-9.

20. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-9.