WO2020082560A1 - Method, apparatus and device for extracting text keyword, as well as computer readable storage medium - Google Patents

Method, apparatus and device for extracting text keyword, as well as computer readable storage medium Download PDF

Info

Publication number
WO2020082560A1
WO2020082560A1 PCT/CN2018/122813 CN2018122813W WO2020082560A1 WO 2020082560 A1 WO2020082560 A1 WO 2020082560A1 CN 2018122813 W CN2018122813 W CN 2018122813W WO 2020082560 A1 WO2020082560 A1 WO 2020082560A1
Authority
WO
WIPO (PCT)
Prior art keywords
vector
keyword
text
word
preset
Prior art date
Application number
PCT/CN2018/122813
Other languages
French (fr)
Chinese (zh)
Inventor
金戈
徐亮
肖京
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020082560A1 publication Critical patent/WO2020082560A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Definitions

  • the present application relates to the field of keyword extraction technology, and in particular to a text keyword extraction method, device, device, and computer-readable storage medium.
  • Keyword extraction is widely used in many fields of text processing, such as: text clustering, text summary, and information retrieval.
  • keyword extraction is playing an important role in the field of NLP, providing a cornerstone for hot issues such as sentiment analysis, semantic analysis, and knowledge graphs.
  • the mainstream representative methods in this field include keyword extraction based on hidden topic model (LDA), keyword extraction based on TF-IDF word frequency statistics and keyword extraction based on word graph model (TextRank).
  • LDA hidden topic model
  • TextRank word graph model
  • This application proposes a new keyword extraction method.
  • the main purpose of the present application is to provide a text keyword extraction method, which aims to solve the technical problem that the existing text keyword extraction efficiency is not high.
  • the present application provides a text keyword extraction method, wherein the text keyword extraction method includes the following steps:
  • the target keyword vector is converted into a corresponding target keyword, and the target keyword is extracted as a text keyword of the text to be extracted.
  • the present application also provides a text keyword extraction device, the text keyword extraction device includes:
  • the first vector conversion module is set to obtain the text to be extracted, and the text to be extracted is converted into a corresponding word vector group according to a preset word vector library;
  • the keyword generation module is set to obtain the text according to the preset optimal generation model. Extract the target keyword vector from the predicate vector group;
  • the second vector conversion module is set to convert the target keyword vector into the corresponding target keyword according to the preset word vector library, and extract the target keyword as Text keywords of the text to be extracted.
  • the present application also provides a text keyword extraction device.
  • the text keyword extraction device includes a processor, a memory, and a computer stored on the memory and executable by the processor. Read instructions, where the computer readable instructions are executed by the processor to implement the steps of the text keyword extraction method as described above.
  • the present application also provides a computer-readable storage medium having computer-readable instructions stored on the computer-readable storage medium, wherein when the computer-readable instructions are executed by a processor, the implementation is as described above Steps of the text keyword extraction method.
  • the text to be extracted is converted into a corresponding word vector group according to a preset word vector library;
  • the target keyword vector is extracted from the word vector group according to a preset optimal generation model , That is, by converting the text to be extracted into vectorized data and using it as the input of the generated model, it can reduce the amount of model calculation and improve the efficiency of text keyword extraction;
  • the target keyword The vector is converted into a corresponding target keyword, and the target keyword is extracted as the text keyword of the text to be extracted to realize the extraction of the text keyword of the text to be extracted.
  • FIG. 1 is a schematic structural diagram of a text keyword extraction device of a hardware operating environment involved in an embodiment of the present application
  • FIG. 2 is a schematic flowchart of a first embodiment of a text keyword extraction method of this application
  • FIG. 3 is a schematic flowchart of a second embodiment of a text keyword extraction method of this application.
  • FIG. 4 is a schematic diagram of function modules of the first embodiment of the text keyword extraction device of the present application.
  • FIG. 1 is a schematic diagram of a hardware structure of a text keyword extraction device provided by this application.
  • the text keyword extraction device may be a PC, or a device with a display function such as a smart phone, tablet computer, portable computer, desktop computer, etc.
  • the text keyword extraction device may be a server device with text A keyword extraction back-end management system through which users manage text keyword extraction devices.
  • the text keyword extraction device may include components such as a processor 101 and a memory 201.
  • the processor 101 is connected to the memory 201, and the memory 201 stores computer-readable instructions.
  • the processor 101 can call the computer-readable instructions stored in the memory 201, and The steps of each embodiment of the text keyword extraction method described below are implemented.
  • the memory 201 can be used to store software programs and various data.
  • the memory 201 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, application programs (such as computer-readable instructions) required for at least one function, etc .; the storage data area may include a database, such as an associated network Node information of the node.
  • the memory 201 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, or other volatile solid-state storage devices.
  • the processor 101 is the control center of the text keyword extraction device, and uses various interfaces and lines to connect the various parts of the entire text keyword extraction device, by running or executing software programs and / or modules stored in the memory 201, and calling The data stored in the memory 201 performs various functions and process data of the text keyword extraction device, thereby performing overall monitoring of the text keyword extraction device.
  • the processor 101 may include one or more processing units; optionally, the processor 101 may integrate an application processor and a modem processor, where the application processor mainly processes an operating system, a user interface, and application programs, etc.
  • the modulation processor mainly handles wireless communication. It can be understood that the foregoing modem processor may not be integrated into the processor 101.
  • the structure of the text keyword extraction device shown in FIG. 1 does not constitute a limitation on the text keyword extraction device, and may include more or fewer components than the illustration, or a combination of certain components, Or different component arrangements.
  • the “extraction device” in the following is an abbreviation of the text keyword extraction device.
  • This application provides a text keyword extraction method.
  • FIG. 2 is a schematic flowchart of a first embodiment of a text keyword extraction method of the present application.
  • the text keyword extraction method includes the following steps:
  • Step S10 Obtain the text to be extracted, and convert the text to be extracted into a corresponding word vector group according to a preset word vector library;
  • the text to be extracted refers to the text data to be extracted with keywords.
  • the text to be extracted is a character string composed of multiple characters in a specific semantic order.
  • the text to be extracted may be an article or a piece of text.
  • the extraction device may provide an input interface to acquire text data that a user needs to extract keywords through the input interface, and use the acquired text data as the text to be extracted.
  • the extraction device may also receive text data sent by other devices that requires keyword extraction, and use the received text data as text to be extracted.
  • the extraction device may also provide a selectable text list to obtain the text to be extracted selected by the user from the selectable text list.
  • Preset word vector library which stores preset corpus words and corresponding word vectors.
  • Word vectors refer to vectors in which words are mapped to real numbers.
  • the text form of "microphone” is expressed in mathematical form "[00010000000 ..]", At this time, "[00010000000 ...]” is the word vector of "microphone”. It can be understood that there is no limitation on what kind of word vectors the textual corpus words are converted into, as long as the textual corpus words can be expressed mathematically.
  • a preset word vector library needs to be established, specifically, including:
  • Step S11 Obtain the corpus text to be converted, and segment the corpus text to obtain the phrase to be converted after the word segmentation;
  • the corpus text to be converted that is, the corpus text to be vector converted
  • the extraction device can pull the corpus text directly from the Internet, such as news or articles, etc.
  • the corpus text can also be obtained from the corpus.
  • phrase to be converted refers to a group of words obtained by segmenting the corpus text to form a corpus text.
  • the “phrase” in this embodiment refers to multiple words, and the phrase to be converted includes multiple words to be converted.
  • Word segmentation is the operation of dividing a continuous character sequence into multiple individual characters or character sequences.
  • the extraction device can then segment the corpus text according to punctuation marks to obtain several sentences, and then segment the sentences to obtain the words that make up the corpus text.
  • the extraction device can use the preset word segmentation method to perform word segmentation on the extracted text to obtain multiple characters (unordered phrases) or character sequences (phrases with a specific arrangement order, such as phrases in the same order as the corpus text sequence), the extraction device can According to the vocabulary, determine the part-of-speech of the words to be converted in the word-to-be-converted group obtained after word segmentation, and also count the corresponding word length of each word.
  • part-of-speech is data that reflects the type of content of the word. 12 types of speech such as predicate and noun.
  • the word length is the number of characters contained in the word.
  • the preset word segmentation method may be a word segmentation method based on character matching, semantic understanding, or statistics.
  • the extraction device may set the word length threshold of each word to be converted obtained by word segmentation, so that the word length of each word to be converted obtained by word segmentation does not exceed the word length threshold.
  • the extraction device determines the corresponding part of speech of the word sequence "I / today / very / happy” and obtains "dawn a / today b / very c / happy d", where a represents the person ’s name, b represents the adverbial, and c represents Adverb, d means predicate etc.
  • the extraction device determines the corresponding word length of each word for the word sequence "I / today / very / happy” and obtains "I1 / today2 / very1 / happy2", where the number indicates the word length.
  • Step S12 Convert each word to be converted in the word group to be converted into a corresponding word vector, and store each word to be converted in association with the corresponding word vector in a preset word vector library.
  • the extraction device vectorizes the word according to the content, part-of-speech and word length of each word to be converted in the word group to be converted to obtain a word vector corresponding to the word, thereby obtaining a word vector corresponding to the word to be converted.
  • the extraction device can use a machine learning model to convert words into word vectors, and the machine learning model can be a word2vec model and so on.
  • the extraction device may set a coding method in advance, through which the part of speech is encoded into a part of speech vector, the word length is encoded into a word length vector, and then the content vector, part of speech vector, and word length vector are combined to obtain the corresponding word Word vector, get the sequence of word vectors.
  • coding methods such as One-Hot coding or integer coding.
  • the method of combining the content vector, part-of-speech vector and word length vector may be direct splicing or indirect splicing through connection vectors. It can be understood that, in the stitching process, the stitching order of the content vector, part-of-speech vector, and word length vector is not limited.
  • Each word to be converted is associated with the corresponding word vector and stored in the preset word vector library.
  • the corresponding word vector can be found in the preset word vector library according to the word to be converted, or the corresponding word to be converted can be found according to the word vector.
  • the corpus text is segmented to obtain a word segment to be converted after word segmentation; each word to be converted in the word group to be converted is converted into a corresponding word vector
  • the word to be converted is associated with the corresponding word vector and stored in a preset word vector library, which is used to convert the text to be extracted into vectorized data and use it as an input to generate a model to reduce the amount of model calculation and improve the efficiency of text keyword extraction provide assurance.
  • the word to be extracted is segmented to obtain each word constituting the text to be extracted, and then each word is vectorized to obtain a word vector corresponding to each word, thereby obtaining a corresponding group of word vectors.
  • the word segmentation method of the extracted text is the same as the word segmentation method of the corpus text to be converted. The related word segmentation methods have been explained above, and will not be repeated here. You can query the preset word vector library to obtain the word vector corresponding to each word that constitutes the text to be extracted, and convert each word of the text to be extracted into the corresponding word vector; it can also be treated by the same word vector conversion method as the word to be converted Extract the text for vectorization, which is not repeated here.
  • Step S20 Extract target keyword vectors from the word vector group according to a preset optimal generation model
  • the preset optimal generative model refers to the generative model that contains the optimal model parameters, that is, the trained generative model.
  • the generative model and the discriminant model are combined to form the generative confrontation network, which is implemented in the generative confrontation network Model training for generating models and discriminating models.
  • the problem to be solved by the generative adversarial network is how to learn new samples from the training samples.
  • a common application is to generate new pictures based on real pictures.
  • the generative model in this embodiment is a machine learning model with keyword vector extraction function after training
  • the discriminant model is a machine with discriminant function that distinguishes the real keyword vector from the predicted keyword vector extracted from the generative model after training.
  • Learning model, machine learning model can have the aforementioned extraction function or discrimination function through sample learning, machine learning model can use neural network model, support vector machine or logistic regression model, etc.
  • the extraction device inputs the word vector group into the optimal generation model, and uses the model parameters of the hidden layer in the optimal generation model to operate on the word vector group to obtain an operation result, that is, to generate the target keyword extracted by the optimal generation model Vector, where the operation of the word vector using the model parameters of the hidden layer may be linear transformation, nonlinear transformation or convolution transformation.
  • the extraction device may sequentially calculate each word vector in the word vector group through the hidden layer of the optimal generation model according to the order of the word vectors in the word vector group, and cyclically compare the previous operation result and The current word vector is used as the input value of the current operation, and the current operation is performed until the last operation. It can be understood that since there is no previous operation in the first processing, the input value in the first operation is the first word vector.
  • the word vector groups corresponding to the text to be extracted are X1, X2, X3, X4, and X5.
  • the hidden layer of the optimal generative model can operate on each word vector in the order of X1-X5 or X5-X1.
  • Step S30 Convert the target keyword vector into the corresponding target keyword according to the preset word vector library, and extract the target keyword as the text keyword of the text to be extracted.
  • the target keyword vector is the keyword vector of the text to be extracted / predicted by the optimal generation model from the input word vector group, and the target keyword is the keyword of the text to be extracted / predicted by the optimal generation model.
  • the word vectors corresponding to each word constituting the text to be extracted are obtained, and each word of the text to be extracted is converted into a corresponding word vector.
  • the step of converting the target keyword vector into the corresponding target keyword specifically includes: querying the preset word vector library, based on the association relationship between the words stored in the preset word vector library and the corresponding vector, from the preset word vector.
  • the target keyword corresponding to the target keyword vector is obtained from the library to complete the conversion of the target keyword vector.
  • the text to be extracted is vectorized by the same word vector conversion method as the word to be converted, wherein the word vector representation method after the word to be converted and the text to be converted is a distributed representation (Distributed representation) word vector Representation method.
  • the step of converting the target keyword vector into the corresponding target keyword according to the preset word vector library specifically includes:
  • Step S31 traverse all the preset word vectors in the preset word vector library, and calculate the Euclidean distance between each preset word vector and the target keyword vector;
  • the word vectors represented by the distributed representation vector representation method make related or similar words, whose mathematical meanings appear as the proximity of vector distances. For example, the distance between "Mike” and “Microphone” will be much smaller than “Mike” and "Weather”.
  • the principle of the distributed representation method of word vectors includes: through training, each word in a specific text of a language is mapped into a fixed-length vector, and all these vectors are put together to form a word vector space, and each vector is For a point in the space, introducing "distance” in this space, the similarity (lexical and semantic) between the words corresponding to them can be judged according to the distance between the word vectors.
  • the distance between the vectors is measured by the Euclidean distance, and the semantic similarity of the words corresponding to the vectors is measured indirectly, that is, the word vector distances of words with the same or similar semantics are close.
  • the Euclidean distance between the target keyword vector and each preset word vector one or more preset word vectors closest to the target keyword vector in the preset word vector library are determined, and then the target keyword vector is determined to correspond to Target keywords.
  • Euclidean distance refers to the arithmetic square root of the sum of squares of the difference of word vectors in various dimensions, which is expressed by the formula:
  • D (X, Y) refers to the Euclidean distance between the X word vector and the Y word vector
  • n is the vector dimension
  • x and y are the X word vector and the Y word vector in each dimension.
  • Step S32 Obtain the matching word vector with the smallest Euclidean distance from the target keyword vector from all the preset word vectors, and obtain the matching words corresponding to the matching word vector from the preset word vector library , The matching word is the target keyword.
  • the smallest Euclidean distance is, the closer the vector is, the preset word vector with the smallest Euclidean distance from the target keyword vector is the word vector closest to the target keyword vector, and the corresponding word is the target keyword.
  • the text to be extracted is converted into a corresponding word vector group according to a preset word vector library;
  • the target keyword vector is extracted from the word vector group according to a preset optimal generation model, That is, by converting the text to be extracted into vectorized data and using it as the input of the generation model, it can reduce the amount of model calculation and improve the efficiency of text keyword extraction;
  • the target keyword vector Convert to the corresponding target keyword, extract the target keyword as the text keyword of the text to be extracted, and realize the extraction of the text keyword of the text to be extracted.
  • the method includes:
  • Step S21 Convert the preset training text into the corresponding training word vector group according to the preset word vector library, and obtain the real keyword vector in the training word vector group;
  • the preset training text that is, the preset training samples for training the generated model and the discriminant model
  • the extraction device can directly pull the training samples from the Internet, or obtain the training samples from the corpus.
  • the training text is segmented to obtain the training words that compose the training text, and then the training words are vectorized to obtain the corresponding training word vectors for each training word, thereby obtaining the corresponding training word vector group.
  • the word segmentation method of the training text is the same as the word segmentation method of the corpus text to be converted. The related word segmentation method has been explained above, and will not be repeated here.
  • the training word vector group is sample data actually input to the generation model and the discriminant model for training the model, and the training word vector group includes multiple training word vectors.
  • the real keywords of the training samples can be input by the user, and the extraction device vectorizes the real keywords to obtain the real keyword vectors; the keyword tags can also be obtained from the real keywords of the training samples when the training samples are grabbed / obtained The device vectorizes real keywords to obtain real keyword vectors.
  • the generative model is used to extract keywords from the text, that is, the predicted text keywords, and the discriminant model is used to determine whether the output of the generative model is a real keyword.
  • Both the generative model and the discriminant model are neural network models.
  • the initial model parameters are randomly set and not optimized.
  • the two models are confronted with training, and the generated model generates predicted text keywords to discriminate the discriminant model and discriminate the model Whether the output of the model is generated is the real keyword.
  • the parameters of the model are continuously optimized, and the capabilities of the two models are getting stronger and stronger, and finally reach the steady state.
  • Step S22 the training word vector group is input into the latest generation model, and the prediction keyword vector is extracted from the training word vector group by the latest generation model output;
  • the latest generation model refers to the generation model with the latest model parameters when the training word vector is entered this time
  • the latest discriminant model refers to the discrimination with the latest model parameters when the training word vector is entered this time model.
  • the model parameters of the initial model are randomly set and not optimized, so the predicted keyword vector calculated by the internal neural network for the first time in the generated model is random.
  • the predicted keyword vector is one or more keyword vectors selected from the training word vector group by the internal model through internal calculation.
  • Step S23 Input the real keyword vector and the predicted keyword vector into the latest discriminant model, and output the matching probability of the predicted keyword vector and the real keyword vector from the latest discriminant model;
  • the training data of the discriminant model includes two types of input, one is the training word vector group corresponding to the training text and the actual keyword vector of the training text, and the other is the prediction keyword vector generated by the training text and the generation model.
  • the goal is to distinguish the real keyword vector from the predicted keyword vector.
  • the latest discriminant model calculates the matching probability of the predicted keyword vector and the real keyword vector. Specifically, in step S23, the real keyword vector and the predicted keyword vector are input into the latest discriminant model, and The output of the latest discriminant model between the matching probability of the predicted keyword vector and the real keyword vector includes:
  • Step S231 calculating the Euclidean distance between each predicted keyword vector and each real keyword vector separately;
  • Euclidean distance refers to the arithmetic square root of the sum of squares of the difference of word vectors in various dimensions, which is expressed by the formula:
  • D (X, Y) refers to the Euclidean distance between the X word vector and the Y word vector
  • n is the vector dimension
  • x and y are the X word vector and the Y word vector in each dimension.
  • the Euclidean distance is used to represent the similarity of the corresponding words of the word vector. The smaller the Euclidean distance, the closer the semantics of the words corresponding to the predicted keyword vector and the real keyword vector, and the predicted keyword vector and the true keyword vector are more matched.
  • Step S232 Count the number of matching predicted word vectors whose Euclidean distance from a preset number of real keyword vectors is less than a preset value, and the preset number is at least one;
  • This embodiment is explained by taking the preset number as an example, that is, counting the number of matching predicted word vectors whose Euclidean distance from any real keyword vector is less than the preset value.
  • the preset number is, counting the number of matching predicted word vectors whose Euclidean distance from any real keyword vector is less than the preset value.
  • the preset value can be obtained through the internal loss function and parameter optimization operation during the model training process, or it can be the initial preset value for the model.
  • the Euclidean distance is less than the preset value, which is the threshold condition for matching the predicted keyword vector with the real keyword vector.
  • the matching predicted word vector is the predicted keyword vector whose Euclidean distance from the preset number of real keyword vectors is less than the preset value In this embodiment, the matching predicted word vector matches the real keyword vector.
  • Step S233 Calculate the matching probability of the predicted keyword vector and the real keyword vector based on the number of matching predicted word vectors.
  • the ratio of the number of matching predicted word vectors to the number of all predicted keyword vectors is the matching probability; in another embodiment, the ratio of the number of matching predicted word vectors to the number of all real keyword vectors is the matching probability .
  • Step S24 if the matching probability is greater than a preset threshold, the latest generation model is a preset optimal generation model.
  • the matching probability is greater than a preset threshold, it means that the model parameters of the latest generated model have obtained the optimal parameters, and the latest generated model is the preset optimal generated model, which is used for subsequent keyword extraction of the text to be extracted .
  • step S23 it includes:
  • Step S25 if the matching probability is less than a preset threshold, calculating the respective loss functions of the latest generation model and the latest discriminant model according to the matching probability;
  • step S26 the model parameters of the latest generation model and the latest discriminant model are optimized according to the respective loss functions of the latest generation model and the latest discriminant model, so as to obtain the latest generation model after the model parameter optimization and update The latest discriminant model;
  • the loss function of the latest discriminant model is as follows:
  • y is the matching probability of the output of the generated model
  • G (z) is the output of the generated model
  • D (x) is the output of the discriminant model.
  • the above function optimizes the parameters of the neural network in the latest discriminant model.
  • the parameters of the generated model are updated.
  • the loss function of the generated model is as follows:
  • y is the matching probability of generating model output
  • G (z) is the output of generating model.
  • the generation model needs to generate prediction keyword vectors as much as possible, so that the discriminant model cannot discriminate it as false.
  • the generative model can generate predicted keyword vectors with a higher degree of confidence.
  • the parameters in the neural network of the generative model are optimized by the generative model's loss function.
  • the loss function is used to describe the model's generating ability or discriminating ability. The smaller the loss function, the higher the model's generating ability or discriminating ability.
  • the loss function is used to differentiate the parameters in the neural network to minimize the loss function in order to obtain a better comparison. Excellent model parameters.
  • step S26 the step of optimizing the model parameters of the latest generation model and the latest discriminant model according to the loss functions of the latest generation model and the latest discriminant model includes:
  • Step S261 According to the respective loss functions of the latest generation model and the latest discriminant model, the ADAM algorithm is used to optimize the model parameters of the latest generation model and the latest discriminant model.
  • the ADAM algorithm (Adaptive Moment Estimation) is an adaptive time estimation method. By calculating the first-order moment estimation and second-order moment estimation of the gradient, an independent adaptive learning rate is designed for different parameters, which can realize the iterative update of the latest based on the training data.
  • the specific steps of the ADAM algorithm include:
  • the ADAM algorithm Compared with other adaptive learning rate algorithms, the ADAM algorithm has a faster convergence speed and more effective learning effect, which can correct problems in other optimization techniques, such as the disappearance of the learning rate, slow convergence, or parameter updates caused by high variance Problems such as large loss function fluctuations.
  • Step S27 the step of inputting the training word vector group into the latest generation model is performed.
  • the optimized generative model and discriminant model are used to perform the step of inputting the training word vector group into the latest generative model until the discriminant model output matching probability is greater than a preset threshold, The iteration is terminated.
  • the present application also provides a text keyword extraction device corresponding to each step of the above text keyword extraction method.
  • FIG. 4 is a schematic diagram of function modules of the first embodiment of the text keyword extraction device of the present application.
  • the device for extracting text keywords in this application includes:
  • the first vector conversion module 10 is configured to obtain text to be extracted, and convert the text to be extracted into a corresponding word vector group according to a preset word vector library;
  • the keyword generation module 20 is configured to extract the target keyword vector from the word vector group according to a preset optimal generation model
  • the second vector conversion module 30 is configured to convert the target keyword vector into the corresponding target keyword according to the preset word vector library, and extract the target keyword as the text keyword of the text to be extracted.
  • the second vector conversion module 30 is further configured to traverse all preset word vectors in the preset word vector library, and calculate the Euclidean of each preset word vector and the target keyword vector respectively Distance; obtaining the matching word vector with the smallest Euclidean distance from the target keyword vector from all the preset word vectors, and obtaining the matching words corresponding to the matching word vector from the preset word vector library,
  • the matching words are target keywords.
  • the text keyword extraction device includes:
  • the training module is set to convert the preset training text into the corresponding training word vector group according to the preset word vector library, and obtain the real keyword vector in the training word vector group; input the training word vector group to The latest generated model, and the predicted keyword vector is extracted from the training word vector group by the latest generated model output; the real keyword vector and the predicted keyword vector are input into the latest discriminant model, and by The latest discriminant model outputs the matching probability of the predicted keyword vector and the real keyword vector; if the matching probability is greater than a preset threshold, the latest generation model is a preset optimal generation model.
  • the training module is further configured to, if the matching probability is less than a preset threshold, calculate the respective loss functions of the latest generation model and the latest discriminant model according to the matching probability; according to the latest generation
  • the respective loss functions of the model and the latest discriminant model optimize the respective model parameters of the latest generation model and the latest discriminant model to obtain the latest generation model and the latest discriminant model after the model parameters are optimized and updated; The step of inputting the training word vector group into the latest generation model.
  • the training module is further configured to optimize the model parameters of the latest generation model and the latest discriminant model by the ADAM algorithm according to the respective loss functions of the latest generation model and the latest discriminant model.
  • the text keyword extraction device further includes:
  • the word segmentation module is set to obtain the text of the corpus to be converted, segment the text of the corpus to obtain the phrase to be converted after the word segmentation;
  • the vector conversion module is configured to convert each word to be converted in the word group to be converted into a corresponding word vector, and store each word to be converted in association with the corresponding word vector in a preset word vector library.
  • the training module is further configured to calculate the Euclidean distance between each predicted keyword vector and each real keyword vector separately; the statistics match the preset Euclidean distance with the preset number of real keyword vectors less than the preset value
  • the number of predicted word vectors is at least one; the matching probability of the predicted keyword vector and the real keyword vector is calculated based on the number of matching predicted word vectors.
  • the present application also proposes a computer-readable storage medium.
  • the computer-readable storage medium may be a non-volatile readable computer-readable storage medium on which a computer program is stored.
  • the computer-readable storage medium may be the memory 201 in the text keyword extraction device of FIG. 1, or may be, for example, ROM (Read-Only Memory, read-only memory) / RAM (Random Access Memory, random access memory), At least one of a magnetic disk or an optical disk, the computer-readable storage medium includes several instructions to make a device with a processor (which may be a mobile phone, a computer, a server, a network device, or a text key in an embodiment of the present application) Word extraction equipment, etc.) execute the methods described in various embodiments of the present application.
  • a processor which may be a mobile phone, a computer, a server, a network device, or a text key in an embodiment of the present application
  • Word extraction equipment etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Provided by the present application are a method, apparatus and device for extracting text keyword on the basis of a classification model and a prediction model, as well as a computer readable storage medium. The method comprises: obtaining a text to be extracted, and converting same into a corresponding word vector group according to a preset word vector library; extracting a target keyword vector from the word vector group according to a preset optimal generation model; converting the target keyword vector into a corresponding target keyword according to the preset word vector library, and extracting the target keyword as a text keyword for the text to be extracted.

Description

文本关键词提取方法、装置、设备及计算机可读存储介质Text keyword extraction method, device, equipment and computer readable storage medium
本申请要求于2018年10月25日提交中国专利局、申请号为201811254895.6、发明名称为“文本关键词提取方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application requires the priority of the Chinese patent application filed on October 25, 2018 in the Chinese Patent Office with the application number 201811254895.6 and the invention titled "Text Keyword Extraction Method, Device, Equipment and Storage Medium", the entire content of which is cited by reference Incorporated in this application.
技术领域Technical field
本申请涉及关键词提取技术领域,尤其涉及一种文本关键词提取方法、装置、设备及计算机可读存储介质。The present application relates to the field of keyword extraction technology, and in particular to a text keyword extraction method, device, device, and computer-readable storage medium.
背景技术Background technique
关键词抽取在文本处理的许多领域中应用广泛,如:文本聚类、文本摘要和信息检索。在当下大数据时代,关键词抽取更是在NLP领域扮演着重要角色,为情感分析、语义分析、知识图谱等热点问题提供了基石。目前该领域主流代表的方法有基于隐含主题模型的关键词抽取(LDA)、基于TF-IDF词频统计的关键词抽取和基于词图模型的关键词抽取(TextRank)。Keyword extraction is widely used in many fields of text processing, such as: text clustering, text summary, and information retrieval. In the current era of big data, keyword extraction is playing an important role in the field of NLP, providing a cornerstone for hot issues such as sentiment analysis, semantic analysis, and knowledge graphs. At present, the mainstream representative methods in this field include keyword extraction based on hidden topic model (LDA), keyword extraction based on TF-IDF word frequency statistics and keyword extraction based on word graph model (TextRank).
本申请提出一种新的关键词抽取方法。This application proposes a new keyword extraction method.
发明内容Summary of the invention
本申请的主要目的在于提供一种文本关键词提取方法,旨在解决现有文本关键词提取效率不高的技术问题。The main purpose of the present application is to provide a text keyword extraction method, which aims to solve the technical problem that the existing text keyword extraction efficiency is not high.
为实现上述目的,本申请提供一种文本关键词提取方法,其中,所述文本关键词提取方法包括以下步骤:To achieve the above object, the present application provides a text keyword extraction method, wherein the text keyword extraction method includes the following steps:
获取待提取文本,根据预置词向量库将所述待提取文本转换为对应的词向量组;Obtain the text to be extracted, and convert the text to be extracted into the corresponding word vector group according to the preset word vector library;
根据预置的最优生成模型从所述词向量组中提取目标关键词向量;Extract a target keyword vector from the word vector group according to a preset optimal generation model;
根据所述预置词向量库,将所述目标关键词向量转换为对应的目标关键词,提取所述目标关键词作为所述待提取文本的文本关键词。According to the preset word vector library, the target keyword vector is converted into a corresponding target keyword, and the target keyword is extracted as a text keyword of the text to be extracted.
此外,为实现上述目的,本申请还提供一种文本关键词提取装置,所述文本关键词提取装置包括:In addition, in order to achieve the above object, the present application also provides a text keyword extraction device, the text keyword extraction device includes:
第一向量转换模块,设置为获取待提取文本,根据预置词向量库将所述待提取文本转换为对应的词向量组;关键词生成模块,设置为根据预置的最优生成模型从所述词向量组中提取目标关键词向量;第二向量转换模块,设置为根据所述预置词向量库,将所述目标关键词向量转换为对应的目标关键词,提取所述目标关键词作为所述待提取文本的文本关键词。The first vector conversion module is set to obtain the text to be extracted, and the text to be extracted is converted into a corresponding word vector group according to a preset word vector library; the keyword generation module is set to obtain the text according to the preset optimal generation model. Extract the target keyword vector from the predicate vector group; the second vector conversion module is set to convert the target keyword vector into the corresponding target keyword according to the preset word vector library, and extract the target keyword as Text keywords of the text to be extracted.
此外,为实现上述目的,本申请还提供一种文本关键词提取设备,所述文本关键词提取设备包括处理器、存储器、以及存储在所述存储器上并可被所述处理器执行的计算机可读指令,其中所述计算机可读指令被所述处理器执行时,实现如上述的文本关键词提取方法的步骤。In addition, in order to achieve the above object, the present application also provides a text keyword extraction device. The text keyword extraction device includes a processor, a memory, and a computer stored on the memory and executable by the processor. Read instructions, where the computer readable instructions are executed by the processor to implement the steps of the text keyword extraction method as described above.
此外,为实现上述目的,本申请还提供一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机可读指令,其中所述计算机可读指令被处理器执行时,实现如上述的文本关键词提取方法的步骤。In addition, in order to achieve the above object, the present application also provides a computer-readable storage medium having computer-readable instructions stored on the computer-readable storage medium, wherein when the computer-readable instructions are executed by a processor, the implementation is as described above Steps of the text keyword extraction method.
本申请实施例通过获取待提取文本,根据预置词向量库将所述待提取文本转换为对应的词向量组;根据预置的最优生成模型从所述词向量组中提取目标关键词向量,即通过将待提取文本转换为向量化数据,并将其作为生成模型的输入,可减少模型运算量,提高文本关键词提取效率;根据所述预置词向量库,将所述目标关键词向量转换为对应的目标关键词,提取所述目标关键词作为所述待提取文本的文本关键词,实现对待提取文本的文本关键词的提取。In this embodiment of the present application, by acquiring the text to be extracted, the text to be extracted is converted into a corresponding word vector group according to a preset word vector library; the target keyword vector is extracted from the word vector group according to a preset optimal generation model , That is, by converting the text to be extracted into vectorized data and using it as the input of the generated model, it can reduce the amount of model calculation and improve the efficiency of text keyword extraction; according to the preset word vector library, the target keyword The vector is converted into a corresponding target keyword, and the target keyword is extracted as the text keyword of the text to be extracted to realize the extraction of the text keyword of the text to be extracted.
附图说明BRIEF DESCRIPTION
图1是本申请实施例方案涉及的硬件运行环境的文本关键词提取设备结构示意图;1 is a schematic structural diagram of a text keyword extraction device of a hardware operating environment involved in an embodiment of the present application;
图2为本申请文本关键词提取方法第一实施例的流程示意图;FIG. 2 is a schematic flowchart of a first embodiment of a text keyword extraction method of this application;
图3为本申请文本关键词提取方法第二实施例的流程示意图;FIG. 3 is a schematic flowchart of a second embodiment of a text keyword extraction method of this application;
图4为本申请文本关键词提取装置第一实施例的功能模块示意图。4 is a schematic diagram of function modules of the first embodiment of the text keyword extraction device of the present application.
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。The implementation, functional characteristics and advantages of the present application will be further described in conjunction with the embodiments and with reference to the drawings.
具体实施方式detailed description
应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。It should be understood that the specific embodiments described herein are only used to explain the present application, and are not used to limit the present application.
请参见图1,图1为本申请所提供的文本关键词提取设备的硬件结构示意图。Please refer to FIG. 1, which is a schematic diagram of a hardware structure of a text keyword extraction device provided by this application.
所述文本关键词提取设备可以是PC,也可以是智能手机、平板电脑、便携计算机、台式计算机等具有显示功能的设备,可选地,所述文本关键词提取设备可以是服务器设备,存在文本关键词提取的后端管理***,用户通过所述后端管理***对文本关键词提取设备进行管理。The text keyword extraction device may be a PC, or a device with a display function such as a smart phone, tablet computer, portable computer, desktop computer, etc. Optionally, the text keyword extraction device may be a server device with text A keyword extraction back-end management system through which users manage text keyword extraction devices.
所述文本关键词提取设备可以包括:处理器101以及存储器201等部件。在所述文本关键词提取设备中,所述处理器101与所述存储器201连接,所述存储器201上存储有计算机可读指令,处理器101可以调用存储器201中存储的计算机可读指令,并实现如下述文本关键词提取方法各实施例的步骤。The text keyword extraction device may include components such as a processor 101 and a memory 201. In the text keyword extraction device, the processor 101 is connected to the memory 201, and the memory 201 stores computer-readable instructions. The processor 101 can call the computer-readable instructions stored in the memory 201, and The steps of each embodiment of the text keyword extraction method described below are implemented.
所述存储器201,可用于存储软件程序以及各种数据。存储器201可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作***、至少一个功能所需的应用程序(比如计算机可读指令)等;存储数据区可包括数据库,例如关联网络的节点的节点信息等。此外,存储器201可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。The memory 201 can be used to store software programs and various data. The memory 201 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, application programs (such as computer-readable instructions) required for at least one function, etc .; the storage data area may include a database, such as an associated network Node information of the node. In addition, the memory 201 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, or other volatile solid-state storage devices.
处理器101,是文本关键词提取设备的控制中心,利用各种接口和线路连接整个文本关键词提取设备的各个部分,通过运行或执行存储在存储器201内的软件程序和/或模块,以及调用存储在存储器201内的数据,执行文本关键词提取设备的各种功能和处理数据,从而对文本关键词提取设备进行整体监控。处理器101可包括一个或多个处理单元;可选地,处理器101可集成应用处理器和调制解调处理器,其中,应用处理器主要处理操作***、用户界面和应用程序等,调制解调处理器主要处理无线通信。可以理解的是,上述调制解调处理器也可以不集成到处理器101中。The processor 101 is the control center of the text keyword extraction device, and uses various interfaces and lines to connect the various parts of the entire text keyword extraction device, by running or executing software programs and / or modules stored in the memory 201, and calling The data stored in the memory 201 performs various functions and process data of the text keyword extraction device, thereby performing overall monitoring of the text keyword extraction device. The processor 101 may include one or more processing units; optionally, the processor 101 may integrate an application processor and a modem processor, where the application processor mainly processes an operating system, a user interface, and application programs, etc. The modulation processor mainly handles wireless communication. It can be understood that the foregoing modem processor may not be integrated into the processor 101.
本领域技术人员可以理解,图1中示出的文本关键词提取设备结构并不构成对文本关键词提取设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。Those skilled in the art may understand that the structure of the text keyword extraction device shown in FIG. 1 does not constitute a limitation on the text keyword extraction device, and may include more or fewer components than the illustration, or a combination of certain components, Or different component arrangements.
基于上述硬件结构,提出本申请方法各个实施例,下文中的“提取设备”为文本关键词提取设备的简称。Based on the above hardware structure, various embodiments of the method of the present application are proposed. The "extraction device" in the following is an abbreviation of the text keyword extraction device.
本申请提供一种文本关键词提取方法。This application provides a text keyword extraction method.
参照图2,图2为本申请文本关键词提取方法第一实施例的流程示意图。Referring to FIG. 2, FIG. 2 is a schematic flowchart of a first embodiment of a text keyword extraction method of the present application.
本实施例中,所述文本关键词提取方法包括以下步骤:In this embodiment, the text keyword extraction method includes the following steps:
步骤S10,获取待提取文本,根据预置词向量库将所述待提取文本转换为对应的词向量组;Step S10: Obtain the text to be extracted, and convert the text to be extracted into a corresponding word vector group according to a preset word vector library;
待提取文本,指待进行关键词提取的文本数据,待提取文本是多个字符按照特定语义顺序构成的字符串,待提取文本可以为一篇文章或一段文字。The text to be extracted refers to the text data to be extracted with keywords. The text to be extracted is a character string composed of multiple characters in a specific semantic order. The text to be extracted may be an article or a piece of text.
对于获取待提取文本,具体地,提取设备可提供输入界面,以获取用户通过输入界面输入需要进行关键词提取的文本数据,将获取的文本数据作为待提取文本。提取设备也可接收其他设备发送的需要进行关键词提取的文本数据,将接收到的文本数据作为待提取文本。提取设备还可提供可选文本列表,以获取用户从可选文本列表中选择的待提取文本。For acquiring text to be extracted, specifically, the extraction device may provide an input interface to acquire text data that a user needs to extract keywords through the input interface, and use the acquired text data as the text to be extracted. The extraction device may also receive text data sent by other devices that requires keyword extraction, and use the received text data as text to be extracted. The extraction device may also provide a selectable text list to obtain the text to be extracted selected by the user from the selectable text list.
预置词向量库,存储有预置的语料词语与对应的词向量,词向量指词语被映射到实数的向量,比如,文本形式的“话筒”表示为数学形式“[00010000000..]”,此时,“[00010000000...]”即为“话筒”的词向量。可以理解,这里不限定将文本形式的语料词语转换为何种表示的词向量,只要能够将文本形式的语料词语进行数学化表示即可。在执行步骤10中的根据预置词向量库将所述待提取文本转换为对应的词向量组这一步骤之前,需建立预置词向量库,具体地,包括:Preset word vector library, which stores preset corpus words and corresponding word vectors. Word vectors refer to vectors in which words are mapped to real numbers. For example, the text form of "microphone" is expressed in mathematical form "[00010000000 ..]", At this time, "[00010000000 ...]" is the word vector of "microphone". It can be understood that there is no limitation on what kind of word vectors the textual corpus words are converted into, as long as the textual corpus words can be expressed mathematically. Before performing the step of converting the text to be extracted into the corresponding word vector group according to the preset word vector library in step 10, a preset word vector library needs to be established, specifically, including:
步骤S11,获取待转换的语料文本,将所述语料文本进行分词,获得分词后的待转换词组;Step S11: Obtain the corpus text to be converted, and segment the corpus text to obtain the phrase to be converted after the word segmentation;
待转换的语料文本,即待进行向量转换的语料文本,提取设备可 直接从互联网上拉取语料文本,比如新闻或者文章等,语料文本也可从语料库获得。The corpus text to be converted, that is, the corpus text to be vector converted, the extraction device can pull the corpus text directly from the Internet, such as news or articles, etc. The corpus text can also be obtained from the corpus.
待转换词组指将语料文本进行分词后,得到的组成语料文本的词语组,本实施例中的“词组”指多个词,待转换词组包含多个待转换词。The phrase to be converted refers to a group of words obtained by segmenting the corpus text to form a corpus text. The “phrase” in this embodiment refers to multiple words, and the phrase to be converted includes multiple words to be converted.
分词是将连续的字符序列切分成多个单独的字符或者字符序列的操作。提取设备可再对语料文本按照标点符号进行语句分割,得到若干语句,再对各语句进行分词,得到组成语料文本的各词。提取设备可采用预设的分词方式对待提取文本进行分词处理,得到多个字符(无顺序词组)或者字符序列(具有特定排列顺序的词组,如与语料文本顺序顺序相同的词组),提取设备可再根据词汇表确定分词处理后获得的待转换词组中各待转换词的词性,还可统计各词相应的词长,其中,词性是反映词的内容所属类型的数据,词性包括形容词、介词、谓词和名词等12种词性。词长是词所包含的字符的数量。预设的分词方式可以是基于字符匹配、基于语义理解或者基于统计的分词方式。提取设备可设置分词得到的各待转换词的词长阀值,使得分词得到的各待转换词的词长均不超过词长阀值。Word segmentation is the operation of dividing a continuous character sequence into multiple individual characters or character sequences. The extraction device can then segment the corpus text according to punctuation marks to obtain several sentences, and then segment the sentences to obtain the words that make up the corpus text. The extraction device can use the preset word segmentation method to perform word segmentation on the extracted text to obtain multiple characters (unordered phrases) or character sequences (phrases with a specific arrangement order, such as phrases in the same order as the corpus text sequence), the extraction device can According to the vocabulary, determine the part-of-speech of the words to be converted in the word-to-be-converted group obtained after word segmentation, and also count the corresponding word length of each word. Among them, part-of-speech is data that reflects the type of content of the word. 12 types of speech such as predicate and noun. The word length is the number of characters contained in the word. The preset word segmentation method may be a word segmentation method based on character matching, semantic understanding, or statistics. The extraction device may set the word length threshold of each word to be converted obtained by word segmentation, so that the word length of each word to be converted obtained by word segmentation does not exceed the word length threshold.
举例说明,提取设备对词序列“我/今天/很/高兴”确定各词相应的词性后得到“黎明a/今天b/很c/高兴d”,其中a表示人名,b表示状语,c表示副词,d表示谓语等。提取设备对词序列“我/今天/很/高兴”确定各词相应的词长后得到“我1/今天2/很1/高兴2”,其中数字表示词长。For example, the extraction device determines the corresponding part of speech of the word sequence "I / today / very / happy" and obtains "dawn a / today b / very c / happy d", where a represents the person ’s name, b represents the adverbial, and c represents Adverb, d means predicate etc. The extraction device determines the corresponding word length of each word for the word sequence "I / today / very / happy" and obtains "I1 / today2 / very1 / happy2", where the number indicates the word length.
步骤S12,将所述待转换词组中各待转换词转换成对应的词向量,将各所述待转换词与对应的词向量关联存储在预置词向量库。Step S12: Convert each word to be converted in the word group to be converted into a corresponding word vector, and store each word to be converted in association with the corresponding word vector in a preset word vector library.
提取设备根据待转换词组中各待转换词的内容、词性与词长,将该词向量化,得到该词相应的词向量,从而得到待转换词相应的词向量。其中,提取设备可以利用机器学习模型将词转化为词向量,机器学习模型可以为word2vec模型等。The extraction device vectorizes the word according to the content, part-of-speech and word length of each word to be converted in the word group to be converted to obtain a word vector corresponding to the word, thereby obtaining a word vector corresponding to the word to be converted. Among them, the extraction device can use a machine learning model to convert words into word vectors, and the machine learning model can be a word2vec model and so on.
具体地,提取设备可以预先设置编码方式,通过该编码方式将词性编码为词性向量,将词长编码为词长向量,然后将内容向量、词性向量和词长向量进行组合得到相应词所对应的词向量,得到词向量序列。其中,编码方式比如One-Hot编码或者整数编码等。内容向量、 词性向量和词长向量组合的方式可以是直接拼接或者是通过连接向量间接拼接。可以理解,在拼接过程中对内容向量、词性向量和词长向量的拼接顺序不作限定。Specifically, the extraction device may set a coding method in advance, through which the part of speech is encoded into a part of speech vector, the word length is encoded into a word length vector, and then the content vector, part of speech vector, and word length vector are combined to obtain the corresponding word Word vector, get the sequence of word vectors. Among them, coding methods such as One-Hot coding or integer coding. The method of combining the content vector, part-of-speech vector and word length vector may be direct splicing or indirect splicing through connection vectors. It can be understood that, in the stitching process, the stitching order of the content vector, part-of-speech vector, and word length vector is not limited.
将各待转换词与对应的词向量关联存储在预置词向量库,可根据待转换词在预置词向量库查找到对应的词向量,也可以根据词向量查找到对应的待转换词。Each word to be converted is associated with the corresponding word vector and stored in the preset word vector library. The corresponding word vector can be found in the preset word vector library according to the word to be converted, or the corresponding word to be converted can be found according to the word vector.
在本实施例中,通过获取待转换的语料文本,将所述语料文本进行分词,获得分词后的待转换词组;将所述待转换词组中各待转换词转换成对应的词向量,将各所述待转换词与对应的词向量关联存储在预置词向量库,为后续将待提取文本转换为向量化数据并将其作为生成模型的输入以减少模型运算量,提高文本关键词提取效率提供保障。In this embodiment, by obtaining the corpus text to be converted, the corpus text is segmented to obtain a word segment to be converted after word segmentation; each word to be converted in the word group to be converted is converted into a corresponding word vector The word to be converted is associated with the corresponding word vector and stored in a preset word vector library, which is used to convert the text to be extracted into vectorized data and use it as an input to generate a model to reduce the amount of model calculation and improve the efficiency of text keyword extraction provide assurance.
提取设备获取待提取文本后,对待提取文本进行分词,得到组成待提取文本的各词,再将各词向量化,得到各词各自对应的词向量,从而得到对应的词向量组,其中,待提取文本的分词方式与待转换的语料文本的分词方式一致,相关分词方式已在前文解释,此处不赘述。可通过查询预置词向量库,获取组成待提取文本的各词对应的词向量,将待提取文本的各词转换为对应的词向量;也可以通过与待转换词相同的词向量转化方式对待提取文本进行向量化,此处不赘述。After the extraction device obtains the text to be extracted, the word to be extracted is segmented to obtain each word constituting the text to be extracted, and then each word is vectorized to obtain a word vector corresponding to each word, thereby obtaining a corresponding group of word vectors. The word segmentation method of the extracted text is the same as the word segmentation method of the corpus text to be converted. The related word segmentation methods have been explained above, and will not be repeated here. You can query the preset word vector library to obtain the word vector corresponding to each word that constitutes the text to be extracted, and convert each word of the text to be extracted into the corresponding word vector; it can also be treated by the same word vector conversion method as the word to be converted Extract the text for vectorization, which is not repeated here.
步骤S20,根据预置的最优生成模型从所述词向量组中提取目标关键词向量;Step S20: Extract target keyword vectors from the word vector group according to a preset optimal generation model;
预置的最优生成模型指包含最优模型参数的生成模型,即训练好的生成模型,在生成模型的训练阶段,以生成模型和判别模型共同组成生成式对抗网络,在生成式对抗网络实现生成模型和判别模型的模型训练。生成式对抗网络要解决的问题是如何从训练样本中学习出新样本,常见的应用是根据真实图片生成新图片。The preset optimal generative model refers to the generative model that contains the optimal model parameters, that is, the trained generative model. In the training stage of the generative model, the generative model and the discriminant model are combined to form the generative confrontation network, which is implemented in the generative confrontation network Model training for generating models and discriminating models. The problem to be solved by the generative adversarial network is how to learn new samples from the training samples. A common application is to generate new pictures based on real pictures.
本实施例中的生成模型为经过训练后具有关键词向量提取功能的机器学习模型,判别模型为经过训练后具有将真实关键词向量与生成模型提取的预测关键词向量进行区分的判别功能的机器学习模型,机器学习模型可通过样本学习具备前述提取功能或判别功能,机器学习模型可采用神经网络模型、支持向量机或者逻辑回归模型等。The generative model in this embodiment is a machine learning model with keyword vector extraction function after training, and the discriminant model is a machine with discriminant function that distinguishes the real keyword vector from the predicted keyword vector extracted from the generative model after training. Learning model, machine learning model can have the aforementioned extraction function or discrimination function through sample learning, machine learning model can use neural network model, support vector machine or logistic regression model, etc.
本实施例中,提取设备将词向量组输入最优生成模型,利用最优生成模型中隐层的模型参数对词向量组进行运算,得到运算结果,即生成最优生成模型提取的目标关键词向量,其中,利用隐层的模型参数对词向量进行的运算可以是线性变换、非线性变换或者卷积变换等。In this embodiment, the extraction device inputs the word vector group into the optimal generation model, and uses the model parameters of the hidden layer in the optimal generation model to operate on the word vector group to obtain an operation result, that is, to generate the target keyword extracted by the optimal generation model Vector, where the operation of the word vector using the model parameters of the hidden layer may be linear transformation, nonlinear transformation or convolution transformation.
一实施方式中,提取设备可以按照词向量组中各词向量的先后顺序,通过最优生成模型的隐层依次地对词向量组中各词向量进行运算,循环地将前次的运算结果和当前词向量作为当次运算的输入值,进行当次运算,直至最后一次运算为止。可以理解,由于第一次处理时不存在前次运算,所以第一次运算时的输入值为第一个词向量。举例说明,待提取文本对应的词向量组为X1、X2、X3、X4、X5。最优生成模型的隐层可按照X1-X5的顺序或者X5-X1的顺序依次地对各词向量进行运算。比如,先将X1作为输入进行运算,得到运算结果Y1,然后将Y1和X2作为输入进行运算,得到运算结果Y2,然后将Y2和X3作为输入进行运算,得到运算结果Y3,依次类推,直至得到最后一个词向量X5相应的运算结果Y5。In one embodiment, the extraction device may sequentially calculate each word vector in the word vector group through the hidden layer of the optimal generation model according to the order of the word vectors in the word vector group, and cyclically compare the previous operation result and The current word vector is used as the input value of the current operation, and the current operation is performed until the last operation. It can be understood that since there is no previous operation in the first processing, the input value in the first operation is the first word vector. For example, the word vector groups corresponding to the text to be extracted are X1, X2, X3, X4, and X5. The hidden layer of the optimal generative model can operate on each word vector in the order of X1-X5 or X5-X1. For example, first use X1 as an input to obtain an operation result Y1, and then use Y1 and X2 as an input to obtain an operation result Y2, and then use Y2 and X3 as an input to obtain an operation result Y3, and so on, until The corresponding operation result Y5 of the last word vector X5.
步骤S30,根据所述预置词向量库,将所述目标关键词向量转换为对应的目标关键词,提取所述目标关键词作为所述待提取文本的文本关键词。Step S30: Convert the target keyword vector into the corresponding target keyword according to the preset word vector library, and extract the target keyword as the text keyword of the text to be extracted.
目标关键词向量即最优生成模型从输入的词向量组中提取的/预测的待提取文本的关键词向量,目标关键词即最优生成模型提取的/预测的待提取文本的关键词。The target keyword vector is the keyword vector of the text to be extracted / predicted by the optimal generation model from the input word vector group, and the target keyword is the keyword of the text to be extracted / predicted by the optimal generation model.
一实施方式中,通过查询预置词向量库,获取组成待提取文本的各词对应的词向量,将待提取文本的各词转换为对应的词向量,此时,根据所述预置词向量库,将所述目标关键词向量转换为对应的目标关键词这一步骤具体包括:查询预置词向量库,基于预置词向量库存储的词与对应向量的关联关系,从预置词向量库中获取目标关键词向量对应的目标关键词,完成目标关键词向量的转化。In one embodiment, by querying the preset word vector library, the word vectors corresponding to each word constituting the text to be extracted are obtained, and each word of the text to be extracted is converted into a corresponding word vector. At this time, according to the preset word vector The step of converting the target keyword vector into the corresponding target keyword specifically includes: querying the preset word vector library, based on the association relationship between the words stored in the preset word vector library and the corresponding vector, from the preset word vector The target keyword corresponding to the target keyword vector is obtained from the library to complete the conversion of the target keyword vector.
另一实施方式中,通过与待转换词相同的词向量转化方式对待提取文本进行向量化,其中,待转换词与待提取文本转化后的词向量表示方式为分布式表示(Distributed representation)词向量表示方法, 此时,根据所述预置词向量库,将所述目标关键词向量转换为对应的目标关键词这一步骤具体包括:In another embodiment, the text to be extracted is vectorized by the same word vector conversion method as the word to be converted, wherein the word vector representation method after the word to be converted and the text to be converted is a distributed representation (Distributed representation) word vector Representation method. At this time, the step of converting the target keyword vector into the corresponding target keyword according to the preset word vector library specifically includes:
步骤S31,遍历所述预置词向量库中所有的预置词向量,分别计算各所述预置词向量与所述目标关键词向量的欧氏距离;Step S31, traverse all the preset word vectors in the preset word vector library, and calculate the Euclidean distance between each preset word vector and the target keyword vector;
Distributed representation词向量表示方法表示的词向量使得相关或者相似的词,其数学含义表现为向量距离的接近。例如,“麦克”和“话筒”的距离会远远小于“麦克”和“天气”。The word vectors represented by the distributed representation vector representation method make related or similar words, whose mathematical meanings appear as the proximity of vector distances. For example, the distance between "Mike" and "Microphone" will be much smaller than "Mike" and "Weather".
Distributed representation词向量表示方法的原理包括:通过训练将某种语言的特定文本中的每一个词映射成一个固定长度的向量,将所有这些向量放在一起形成一个词向量空间,而每一向量则为该空间中的一个点,在这个空间上引入“距离”,则可根据词向量之间的距离来判断它们对应的词之间的(词法、语义上的)相似性。The principle of the distributed representation method of word vectors includes: through training, each word in a specific text of a language is mapped into a fixed-length vector, and all these vectors are put together to form a word vector space, and each vector is For a point in the space, introducing "distance" in this space, the similarity (lexical and semantic) between the words corresponding to them can be judged according to the distance between the word vectors.
本实施方式中,以欧氏距离来衡量向量之间的距离,间接衡量向量对应的词的语义的相似性,即语义相同或相似的词语词向量距离相近。通过分别计算目标关键词向量与各预置词向量间的欧氏距离,确定预置词向量库中与目标关键词向量最接近的一个或多个预置词向量,进而确定目标关键词向量对应的目标关键词。In this embodiment, the distance between the vectors is measured by the Euclidean distance, and the semantic similarity of the words corresponding to the vectors is measured indirectly, that is, the word vector distances of words with the same or similar semantics are close. By separately calculating the Euclidean distance between the target keyword vector and each preset word vector, one or more preset word vectors closest to the target keyword vector in the preset word vector library are determined, and then the target keyword vector is determined to correspond to Target keywords.
欧氏距离是指词向量在各维度上差值的平方和的算术平方根,以公式表示为:Euclidean distance refers to the arithmetic square root of the sum of squares of the difference of word vectors in various dimensions, which is expressed by the formula:
Figure PCTCN2018122813-appb-000001
Figure PCTCN2018122813-appb-000001
其中,D(X,Y)指X词向量与Y词向量的欧氏距离,n为向量维度,x、y为各维度的X词向量与Y词向量。Among them, D (X, Y) refers to the Euclidean distance between the X word vector and the Y word vector, n is the vector dimension, and x and y are the X word vector and the Y word vector in each dimension.
步骤S32,从所有的预置词向量中获取与所述目标关键词向量之间欧氏距离最小的匹配词向量,并从所述预置词向量库中获取所述匹配词向量对应的匹配词,所述匹配词为目标关键词。Step S32: Obtain the matching word vector with the smallest Euclidean distance from the target keyword vector from all the preset word vectors, and obtain the matching words corresponding to the matching word vector from the preset word vector library , The matching word is the target keyword.
欧氏距离最小,向量越接近,与目标关键词向量之间欧氏距离最小的预置词向量为与目标关键词向量最接近的词向量,其对应的词即为目标关键词。The smallest Euclidean distance is, the closer the vector is, the preset word vector with the smallest Euclidean distance from the target keyword vector is the word vector closest to the target keyword vector, and the corresponding word is the target keyword.
本实施例通过获取待提取文本,根据预置词向量库将所述待提取 文本转换为对应的词向量组;根据预置的最优生成模型从所述词向量组中提取目标关键词向量,即通过将待提取文本转换为向量化数据,并将其作为生成模型的输入,可减少模型运算量,提高文本关键词提取效率;根据所述预置词向量库,将所述目标关键词向量转换为对应的目标关键词,提取所述目标关键词作为所述待提取文本的文本关键词,实现对待提取文本的文本关键词的提取。In this embodiment, by acquiring the text to be extracted, the text to be extracted is converted into a corresponding word vector group according to a preset word vector library; the target keyword vector is extracted from the word vector group according to a preset optimal generation model, That is, by converting the text to be extracted into vectorized data and using it as the input of the generation model, it can reduce the amount of model calculation and improve the efficiency of text keyword extraction; according to the preset word vector library, the target keyword vector Convert to the corresponding target keyword, extract the target keyword as the text keyword of the text to be extracted, and realize the extraction of the text keyword of the text to be extracted.
进一步地,如图3,在本申请文本关键词提取方法的第二实施例中,所述步骤S20之前包括:Further, as shown in FIG. 3, in the second embodiment of the text keyword extraction method of the present application, before step S20, the method includes:
步骤S21,根据预置词向量库将预置的训练文本转换为对应的训练词向量组,并获取所述训练词向量组中的真实关键词向量;Step S21: Convert the preset training text into the corresponding training word vector group according to the preset word vector library, and obtain the real keyword vector in the training word vector group;
预置的训练文本,即预置的用于训练生成模型和判别模型的训练样本,提取设备可直接从互联网上拉取训练样本,也可从语料库获得训练样本。提取设备获取训练文本后,对训练文本进行分词,得到组成训练文本的各训练词,再将各训练词向量化,得到各训练词各自对应的训练词向量,从而得到对应的训练词向量组,其中,训练文本的分词方式与待转换的语料文本的分词方式一致,相关分词方式已在前文解释,此处不赘述。The preset training text, that is, the preset training samples for training the generated model and the discriminant model, the extraction device can directly pull the training samples from the Internet, or obtain the training samples from the corpus. After the extraction device obtains the training text, the training text is segmented to obtain the training words that compose the training text, and then the training words are vectorized to obtain the corresponding training word vectors for each training word, thereby obtaining the corresponding training word vector group. Among them, the word segmentation method of the training text is the same as the word segmentation method of the corpus text to be converted. The related word segmentation method has been explained above, and will not be repeated here.
可通过查询预置词向量库,获取组成训练文本的各训练词对应的训练词向量,将训练文本的各训练词转换为对应的训练词向量,获得训练词向量组;也可通过与待转换词相同的词向量转化方式对训练文本进行向量化,此处不赘述。You can obtain the training word vector corresponding to each training word that constitutes the training text by querying the preset word vector library, and convert each training word of the training text into the corresponding training word vector to obtain the training word vector group; The word vector conversion method with the same word vectorizes the training text, which is not repeated here.
训练词向量组是实际输入到生成模型和判别模型以进行训练模型的样本数据,训练词向量组包括多个训练词向量。The training word vector group is sample data actually input to the generation model and the discriminant model for training the model, and the training word vector group includes multiple training word vectors.
可由用户输入训练样本的真实关键词,提取设备将真实关键词向量化获得真实关键词向量;也可从在抓取/获取训练样本时,获取关键词标签,作为训练样本的真实关键词,提取设备将真实关键词向量化获得真实关键词向量。The real keywords of the training samples can be input by the user, and the extraction device vectorizes the real keywords to obtain the real keyword vectors; the keyword tags can also be obtained from the real keywords of the training samples when the training samples are grabbed / obtained The device vectorizes real keywords to obtain real keyword vectors.
生成模型用于从文本中提取关键词,即预测文本关键词,判别模型用于判断生成模型输出的是不是真实关键词。生成模型和判别模型均为神经网络模型,初始模型参数随机设定,并没有进行优化,随后, 这两个模型一起对抗训练,生成模型产生预测的文本关键词给判别模型进行判别,判别模型判断生成模型输出的是不是真实关键词,在这两个模型训练的过程中,不断优化模型参数,两个模型的能力越来越强,最终达到稳态。The generative model is used to extract keywords from the text, that is, the predicted text keywords, and the discriminant model is used to determine whether the output of the generative model is a real keyword. Both the generative model and the discriminant model are neural network models. The initial model parameters are randomly set and not optimized. Subsequently, the two models are confronted with training, and the generated model generates predicted text keywords to discriminate the discriminant model and discriminate the model Whether the output of the model is generated is the real keyword. During the training of these two models, the parameters of the model are continuously optimized, and the capabilities of the two models are getting stronger and stronger, and finally reach the steady state.
步骤S22,将所述训练词向量组输入到最新生成模型中,并由所述最新生成模型输出从所述训练词向量组中提取预测关键词向量;Step S22, the training word vector group is input into the latest generation model, and the prediction keyword vector is extracted from the training word vector group by the latest generation model output;
在模型训练过程中,模型参数不断优化更新,最新生成模型指本次输入训练词向量时、具有最新模型参数的生成模型,最新判别模型指本次输入训练词向量时、具有最新模型参数的判别模型。During the model training process, the model parameters are continuously optimized and updated. The latest generation model refers to the generation model with the latest model parameters when the training word vector is entered this time, and the latest discriminant model refers to the discrimination with the latest model parameters when the training word vector is entered this time model.
初始化模型的模型参数随机设定,并没有进行优化,因此生成模型第一次经过内部神经网络计算出的预测关键词向量是随机的。预测关键词向量即生成模型经过内部运算从训练词向量组中选取的一个或多个关键词向量。The model parameters of the initial model are randomly set and not optimized, so the predicted keyword vector calculated by the internal neural network for the first time in the generated model is random. The predicted keyword vector is one or more keyword vectors selected from the training word vector group by the internal model through internal calculation.
步骤S23,将所述真实关键词向量和所述预测关键词向量输入到最新判别模型中,并由所述最新判别模型输出所述预测关键词向量与所述真实关键词向量的匹配概率;Step S23: Input the real keyword vector and the predicted keyword vector into the latest discriminant model, and output the matching probability of the predicted keyword vector and the real keyword vector from the latest discriminant model;
判别模型的训练数据中包括两类输入,一类是训练文本对应的训练词向量组及训练文本的真实关键词向量,另一类是训练文本及生成模型生成的预测关键词向量,判别模型的目标是将真实关键词向量与预测关键词向量进行区分。The training data of the discriminant model includes two types of input, one is the training word vector group corresponding to the training text and the actual keyword vector of the training text, and the other is the prediction keyword vector generated by the training text and the generation model. The goal is to distinguish the real keyword vector from the predicted keyword vector.
最新判别模型对预测关键词向量与真实关键词向量的匹配概率进行计算,具体地,所述步骤S23中将所述真实关键词向量和所述预测关键词向量输入到最新判别模型中,与由所述最新判别模型输出所述预测关键词向量与所述真实关键词向量的匹配概率之间包括:The latest discriminant model calculates the matching probability of the predicted keyword vector and the real keyword vector. Specifically, in step S23, the real keyword vector and the predicted keyword vector are input into the latest discriminant model, and The output of the latest discriminant model between the matching probability of the predicted keyword vector and the real keyword vector includes:
步骤S231,分别计算各预测关键词向量与每个真实关键词向量的欧氏距离;Step S231, calculating the Euclidean distance between each predicted keyword vector and each real keyword vector separately;
欧氏距离是指词向量在各维度上差值的平方和的算术平方根,以公式表示为:Euclidean distance refers to the arithmetic square root of the sum of squares of the difference of word vectors in various dimensions, which is expressed by the formula:
Figure PCTCN2018122813-appb-000002
Figure PCTCN2018122813-appb-000002
其中,D(X,Y)指X词向量与Y词向量的欧氏距离,n为向量维度,x、y为各维度的X词向量与Y词向量。Among them, D (X, Y) refers to the Euclidean distance between the X word vector and the Y word vector, n is the vector dimension, and x and y are the X word vector and the Y word vector in each dimension.
用欧氏距离表征词向量对应词语的相似度,欧氏距离越小,预测关键词向量与真实关键词向量对应的词语语义更接近,则预测关键词向量与真实关键词向量更匹配。The Euclidean distance is used to represent the similarity of the corresponding words of the word vector. The smaller the Euclidean distance, the closer the semantics of the words corresponding to the predicted keyword vector and the real keyword vector, and the predicted keyword vector and the true keyword vector are more matched.
步骤S232,统计与预设个数真实关键词向量的欧氏距离小于预设值的匹配预测词向量的数目,所述预设个数至少为一个;Step S232: Count the number of matching predicted word vectors whose Euclidean distance from a preset number of real keyword vectors is less than a preset value, and the preset number is at least one;
以预设个数为一个为例对本实施例进行解释,即统计与任一真实关键词向量的欧氏距离小于预设值的匹配预测词向量的数目。对于一个待提取文本/训练文本,真实关键词向量和预测关键词向量都可能有多个,各个预测关键词向量只要与任意一个真实关键词向量匹配,就相当于该预测关键词向量与真实关键词向量匹配。This embodiment is explained by taking the preset number as an example, that is, counting the number of matching predicted word vectors whose Euclidean distance from any real keyword vector is less than the preset value. For a text / training text to be extracted, there may be multiple real keyword vectors and predicted keyword vectors. As long as each predicted keyword vector matches any real keyword vector, it is equivalent to the predicted keyword vector and the real key Word vector matching.
预设值可以在模型训练过程中,经内部损失函数及参数优化运算获得,也可以为模型初始预设的值。The preset value can be obtained through the internal loss function and parameter optimization operation during the model training process, or it can be the initial preset value for the model.
欧氏距离小于预设值,即预测关键词向量与真实关键词向量匹配的阈值条件,匹配预测词向量即与预设个数真实关键词向量的欧氏距离小于预设值的预测关键词向量,在本实施例中,匹配预测词向量与真实关键词向量匹配。The Euclidean distance is less than the preset value, which is the threshold condition for matching the predicted keyword vector with the real keyword vector. The matching predicted word vector is the predicted keyword vector whose Euclidean distance from the preset number of real keyword vectors is less than the preset value In this embodiment, the matching predicted word vector matches the real keyword vector.
步骤S233,基于所述匹配预测词向量的数目计算所述预测关键词向量与所述真实关键词向量的匹配概率。Step S233: Calculate the matching probability of the predicted keyword vector and the real keyword vector based on the number of matching predicted word vectors.
在一实施方式中,匹配预测词向量的数目与所有预测关键词向量数目的比值为匹配概率;在另一实施方式中,匹配预测词向量的数目与所有真实关键词向量数目的比值为匹配概率。In one embodiment, the ratio of the number of matching predicted word vectors to the number of all predicted keyword vectors is the matching probability; in another embodiment, the ratio of the number of matching predicted word vectors to the number of all real keyword vectors is the matching probability .
步骤S24,若所述匹配概率大于预设阈值,则所述最新生成模型为预置的最优生成模型。Step S24, if the matching probability is greater than a preset threshold, the latest generation model is a preset optimal generation model.
若所述匹配概率大于预设阈值,则说明最新生成模型的模型参数已取得最优参数,则所述最新生成模型为预置的最优生成模型,用于后续的待提取文本的关键词提取。If the matching probability is greater than a preset threshold, it means that the model parameters of the latest generated model have obtained the optimal parameters, and the latest generated model is the preset optimal generated model, which is used for subsequent keyword extraction of the text to be extracted .
进一步地,所述步骤S23之后包括:Further, after the step S23, it includes:
步骤S25,若所述匹配概率小于预设阈值,则根据所述匹配概率 计算获得所述最新生成模型和所述最新判别模型各自的损失函数;Step S25, if the matching probability is less than a preset threshold, calculating the respective loss functions of the latest generation model and the latest discriminant model according to the matching probability;
步骤S26,根据所述最新生成模型和所述最新判别模型各自的损失函数对所述最新生成模型和所述最新判别模型各自的模型参数进行优化,以获得模型参数优化更新后的最新生成模型和最新判别模型;In step S26, the model parameters of the latest generation model and the latest discriminant model are optimized according to the respective loss functions of the latest generation model and the latest discriminant model, so as to obtain the latest generation model after the model parameter optimization and update The latest discriminant model;
最新判别模型的损失函数如下所示:The loss function of the latest discriminant model is as follows:
-((1-y)log(1-D(G(z))))-ylog(D(x))-((1-y) log (1-D (G (z))))-ylog (D (x))
其中,y为生成模型输出的匹配概率,G(z)是生成模型的输出,D(x)为判别模型的输出。Among them, y is the matching probability of the output of the generated model, G (z) is the output of the generated model, and D (x) is the output of the discriminant model.
最新判别模型的损失函数的含义在于:尽可能将与真实关键词向量匹配的预测关键词向量标为y=1,将与真实关键词向量不匹配的预测关键词向量标为y=0,通过上述函数对最新判别模型中神经网络的参数进行优化。The meaning of the loss function of the latest discriminant model is: as far as possible, the predicted keyword vector that matches the real keyword vector is marked as y = 1, and the predicted keyword vector that does not match the real keyword vector is marked as y = 0, through The above function optimizes the parameters of the neural network in the latest discriminant model.
当更新完判别模型的参数后,再更新生成模型的参数。After the parameters of the discriminant model are updated, the parameters of the generated model are updated.
生成模型的损失函数如下所示:The loss function of the generated model is as follows:
(1-y)log(1-D(G(z)))(1-y) log (1-D (G (z)))
其中,y为生成模型输出的匹配概率,G(z)是生成模型的输出。Among them, y is the matching probability of generating model output, and G (z) is the output of generating model.
生成模型需要尽可能生成预测关键词向量,使得判别模型无法将其判别为假。在这种情况下,生成模型便能够生成具有较高可信度的预测关键词向量。在得到生成模型的损失函数后,通过生成模型的损失函数对生成模型的神经网络中的参数进行优化。损失函数用于描述模型的生成能力或判别能力,损失函数越小,模型的生成能力或判别能力越高,通过损失函数对神经网络中的参数求导,使损失函数最小化,以求得较优的模型参数。The generation model needs to generate prediction keyword vectors as much as possible, so that the discriminant model cannot discriminate it as false. In this case, the generative model can generate predicted keyword vectors with a higher degree of confidence. After the loss function of the generative model is obtained, the parameters in the neural network of the generative model are optimized by the generative model's loss function. The loss function is used to describe the model's generating ability or discriminating ability. The smaller the loss function, the higher the model's generating ability or discriminating ability. The loss function is used to differentiate the parameters in the neural network to minimize the loss function in order to obtain a better comparison. Excellent model parameters.
其中,步骤S26中所述根据所述最新生成模型和所述最新判别模型各自的损失函数对所述最新生成模型和所述最新判别模型各自的模型参数进行优化的步骤包括:In step S26, the step of optimizing the model parameters of the latest generation model and the latest discriminant model according to the loss functions of the latest generation model and the latest discriminant model includes:
步骤S261,根据所述最新生成模型和所述最新判别模型各自的损失函数,通过ADAM算法对所述最新生成模型和所述最新判别模型各自的模型参数进行优化。Step S261: According to the respective loss functions of the latest generation model and the latest discriminant model, the ADAM algorithm is used to optimize the model parameters of the latest generation model and the latest discriminant model.
ADAM算法(Adaptive Moment Estimation)即自适应时刻估计 方法,通过计算梯度的一阶矩估计和二阶矩估计而为不同的参数设计独立的自适应性学习率,可实现基于训练数据迭代地更新最新生成模型和最新判别模型中神经网络的权重。ADAM算法具体步骤包括:The ADAM algorithm (Adaptive Moment Estimation) is an adaptive time estimation method. By calculating the first-order moment estimation and second-order moment estimation of the gradient, an independent adaptive learning rate is designed for different parameters, which can realize the iterative update of the latest based on the training data. The weights of the neural network in the generated model and the latest discriminant model. The specific steps of the ADAM algorithm include:
分别确定a(学习率)、β1(一阶矩估计的指数衰减率,如0.9);β2:(二阶矩估计的指数衰减率,如0.999);随机目标函数(e)(本实施例中即最新生成模型和最新判别模型);在确定了参数a、β1、β2和随机目标函数(e)之后,初始化参数向量、一阶矩向量、二阶矩向量和时间步;当参数e没有收敛时,循环迭代地更新各个部分:即时间步t加1、更新目标函数在该时间步上对参数e所求的梯度、更新偏差的一阶矩估计和二阶原始矩估计,再计算偏差修正的一阶矩估计和偏差修正的二阶矩估计,然后再用以上计算出来的值更新最新生成模型和最新判别模型的参数e。Determine a (learning rate), β1 (exponential decay rate of first-order moment estimation, such as 0.9); β2: (exponential decay rate of second-order moment estimation, such as 0.999); random objective function (e) (in this embodiment Namely the latest generation model and the latest discriminant model); after determining the parameters a, β1, β2 and the random objective function (e), initialize the parameter vector, first-order moment vector, second-order moment vector and time step; when the parameter e does not converge At the time, iteratively update each part in a loop: that is, time step t plus 1, update the gradient of the objective function for the parameter e at this time step, update the first-order moment estimate and second-order original moment estimate of the deviation, and then calculate the deviation correction The first-order moment estimation and the deviation-corrected second-order moment estimation, and then update the parameters e of the latest generation model and the latest discriminant model with the values calculated above.
ADAM算法与其他自适应学习率算法相比,其收敛速度更快,学习效果更为有效,可纠正其他优化技术中存在的问题,如学习率消失、收敛过慢或是高方差的参数更新导致损失函数波动较大等问题。Compared with other adaptive learning rate algorithms, the ADAM algorithm has a faster convergence speed and more effective learning effect, which can correct problems in other optimization techniques, such as the disappearance of the learning rate, slow convergence, or parameter updates caused by high variance Problems such as large loss function fluctuations.
步骤S27,执行所述将所述训练词向量组输入到最新生成模型中的步骤。Step S27, the step of inputting the training word vector group into the latest generation model is performed.
在对生成模型和判别模型进行优化后,利用优化后的生成模型和判别模型执行所述将所述训练词向量组输入到最新生成模型中的步骤,直至判别模型输出匹配概率大于预设阈值,迭代终止。After optimizing the generative model and the discriminant model, the optimized generative model and discriminant model are used to perform the step of inputting the training word vector group into the latest generative model until the discriminant model output matching probability is greater than a preset threshold, The iteration is terminated.
此外,本申请还提供一种与上述文本关键词提取方法各步骤对应的文本关键词提取装置。In addition, the present application also provides a text keyword extraction device corresponding to each step of the above text keyword extraction method.
参照图4,图4为本申请文本关键词提取装置第一实施例的功能模块示意图。Referring to FIG. 4, FIG. 4 is a schematic diagram of function modules of the first embodiment of the text keyword extraction device of the present application.
在本实施例中,本申请文本关键词提取装置包括:In this embodiment, the device for extracting text keywords in this application includes:
第一向量转换模块10,设置为获取待提取文本,根据预置词向量库将所述待提取文本转换为对应的词向量组;The first vector conversion module 10 is configured to obtain text to be extracted, and convert the text to be extracted into a corresponding word vector group according to a preset word vector library;
关键词生成模块20,设置为根据预置的最优生成模型从所述词向量组中提取目标关键词向量;The keyword generation module 20 is configured to extract the target keyword vector from the word vector group according to a preset optimal generation model;
第二向量转换模块30,设置为根据所述预置词向量库,将所述 目标关键词向量转换为对应的目标关键词,提取所述目标关键词作为所述待提取文本的文本关键词。The second vector conversion module 30 is configured to convert the target keyword vector into the corresponding target keyword according to the preset word vector library, and extract the target keyword as the text keyword of the text to be extracted.
进一步地,所述第二向量转换模块30,还设置为遍历所述预置词向量库中所有的预置词向量,分别计算各所述预置词向量与所述目标关键词向量的欧氏距离;从所有的预置词向量中获取与所述目标关键词向量之间欧氏距离最小的匹配词向量,并从所述预置词向量库中获取所述匹配词向量对应的匹配词,所述匹配词为目标关键词。Further, the second vector conversion module 30 is further configured to traverse all preset word vectors in the preset word vector library, and calculate the Euclidean of each preset word vector and the target keyword vector respectively Distance; obtaining the matching word vector with the smallest Euclidean distance from the target keyword vector from all the preset word vectors, and obtaining the matching words corresponding to the matching word vector from the preset word vector library, The matching words are target keywords.
进一步地,所述文本关键词提取装置包括:Further, the text keyword extraction device includes:
训练模块,设置为根据预置词向量库将预置的训练文本转换为对应的训练词向量组,并获取所述训练词向量组中的真实关键词向量;将所述训练词向量组输入到最新生成模型中,并由所述最新生成模型输出从所述训练词向量组中提取预测关键词向量;将所述真实关键词向量和所述预测关键词向量输入到最新判别模型中,并由所述最新判别模型输出所述预测关键词向量与所述真实关键词向量的匹配概率;若所述匹配概率大于预设阈值,则所述最新生成模型为预置的最优生成模型。The training module is set to convert the preset training text into the corresponding training word vector group according to the preset word vector library, and obtain the real keyword vector in the training word vector group; input the training word vector group to The latest generated model, and the predicted keyword vector is extracted from the training word vector group by the latest generated model output; the real keyword vector and the predicted keyword vector are input into the latest discriminant model, and by The latest discriminant model outputs the matching probability of the predicted keyword vector and the real keyword vector; if the matching probability is greater than a preset threshold, the latest generation model is a preset optimal generation model.
进一步地,所述训练模块,还设置为若所述匹配概率小于预设阈值,则根据所述匹配概率计算获得所述最新生成模型和所述最新判别模型各自的损失函数;根据所述最新生成模型和所述最新判别模型各自的损失函数对所述最新生成模型和所述最新判别模型各自的模型参数进行优化,以获得模型参数优化更新后的最新生成模型和最新判别模型;执行所述将所述训练词向量组输入到最新生成模型中的步骤。Further, the training module is further configured to, if the matching probability is less than a preset threshold, calculate the respective loss functions of the latest generation model and the latest discriminant model according to the matching probability; according to the latest generation The respective loss functions of the model and the latest discriminant model optimize the respective model parameters of the latest generation model and the latest discriminant model to obtain the latest generation model and the latest discriminant model after the model parameters are optimized and updated; The step of inputting the training word vector group into the latest generation model.
进一步地,所述训练模块,还设置为根据所述最新生成模型和所述最新判别模型各自的损失函数,通过ADAM算法对所述最新生成模型和所述最新判别模型各自的模型参数进行优化。Further, the training module is further configured to optimize the model parameters of the latest generation model and the latest discriminant model by the ADAM algorithm according to the respective loss functions of the latest generation model and the latest discriminant model.
进一步地,所述文本关键词提取装置还包括:Further, the text keyword extraction device further includes:
分词模块,设置为获取待转换的语料文本,将所述语料文本进行分词,获得分词后的待转换词组;The word segmentation module is set to obtain the text of the corpus to be converted, segment the text of the corpus to obtain the phrase to be converted after the word segmentation;
向量转换模块,设置为将所述待转换词组中各待转换词转换成对应的词向量,将各所述待转换词与对应的词向量关联存储在预置词向 量库。The vector conversion module is configured to convert each word to be converted in the word group to be converted into a corresponding word vector, and store each word to be converted in association with the corresponding word vector in a preset word vector library.
进一步地,所述训练模块,还设置为分别计算各预测关键词向量与每个真实关键词向量的欧氏距离;统计与预设个数真实关键词向量的欧氏距离小于预设值的匹配预测词向量的数目,所述预设个数至少为一个;基于所述匹配预测词向量的数目计算所述预测关键词向量与所述真实关键词向量的匹配概率。Further, the training module is further configured to calculate the Euclidean distance between each predicted keyword vector and each real keyword vector separately; the statistics match the preset Euclidean distance with the preset number of real keyword vectors less than the preset value The number of predicted word vectors is at least one; the matching probability of the predicted keyword vector and the real keyword vector is calculated based on the number of matching predicted word vectors.
本申请还提出一种计算机可读存储介质,该计算机可读存储介质可以为非易失性可读计算机可读存储介质,其上存储有计算机程序。所述计算机可读存储介质可以是图1的文本关键词提取设备中的存储器201,也可以是如ROM(Read-Only Memory,只读存储器)/RAM(Random Access Memory,随机存取存储器)、磁碟、光盘中的至少一种,所述计算机可读存储介质包括若干指令用以使得一台具有处理器的设备(可以是手机,计算机,服务器,网络设备或本申请实施例中的文本关键词提取设备等)执行本申请各个实施例所述的方法。The present application also proposes a computer-readable storage medium. The computer-readable storage medium may be a non-volatile readable computer-readable storage medium on which a computer program is stored. The computer-readable storage medium may be the memory 201 in the text keyword extraction device of FIG. 1, or may be, for example, ROM (Read-Only Memory, read-only memory) / RAM (Random Access Memory, random access memory), At least one of a magnetic disk or an optical disk, the computer-readable storage medium includes several instructions to make a device with a processor (which may be a mobile phone, a computer, a server, a network device, or a text key in an embodiment of the present application) Word extraction equipment, etc.) execute the methods described in various embodiments of the present application.
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者服务端不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者服务端所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者服务端中还存在另外的相同要素。It should be noted that in this article, the terms "include", "include" or any other variant thereof are intended to cover non-exclusive inclusion, so that a process, method, article or service that includes a series of elements includes not only those , And also include other elements that are not explicitly listed, or include elements inherent to such processes, methods, goods, or services. Without more restrictions, the element defined by the sentence "include one ..." does not exclude that there are other identical elements in the process, method, article, or server that includes the element.
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。The sequence numbers of the above embodiments of the present application are for description only, and do not represent the advantages and disadvantages of the embodiments.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。Through the description of the above embodiments, those skilled in the art can clearly understand that the methods in the above embodiments can be implemented by means of software plus a necessary general hardware platform, and of course, can also be implemented by hardware, but in many cases the former is better Implementation.
以上仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。The above are only the preferred embodiments of the present application, and do not limit the patent scope of the present application. Any equivalent structure or equivalent process transformation made by using the description and drawings of this application, or directly or indirectly used in other related technical fields , The same reason is included in the scope of patent protection of this application.

Claims (20)

  1. 一种文本关键词提取方法,其中,所述文本关键词提取方法包括以下步骤:A text keyword extraction method, wherein the text keyword extraction method includes the following steps:
    获取待提取文本,根据预置词向量库将所述待提取文本转换为对应的词向量组;Obtain the text to be extracted, and convert the text to be extracted into the corresponding word vector group according to the preset word vector library;
    根据预置的最优生成模型从所述词向量组中提取目标关键词向量;Extract a target keyword vector from the word vector group according to a preset optimal generation model;
    根据所述预置词向量库,将所述目标关键词向量转换为对应的目标关键词,提取所述目标关键词作为所述待提取文本的文本关键词。According to the preset word vector library, the target keyword vector is converted into a corresponding target keyword, and the target keyword is extracted as a text keyword of the text to be extracted.
  2. 如权利要求1所述的文本关键词提取方法,其中,所述根据所述预置词向量库,将所述目标关键词向量转换为对应的目标关键词的步骤包括:The text keyword extraction method according to claim 1, wherein the step of converting the target keyword vector into the corresponding target keyword according to the preset word vector library includes:
    遍历所述预置词向量库中所有的预置词向量,分别计算各所述预置词向量与所述目标关键词向量的欧氏距离;Traverse all the preset word vectors in the preset word vector library, and calculate the Euclidean distance between each preset word vector and the target keyword vector;
    从所有的预置词向量中获取与所述目标关键词向量之间欧氏距离最小的匹配词向量,并从所述预置词向量库中获取所述匹配词向量对应的匹配词,所述匹配词为目标关键词。Obtain the matching word vector with the smallest Euclidean distance from the target keyword vector from all the preset word vectors, and obtain the matching words corresponding to the matching word vector from the preset word vector library, the Matching words are target keywords.
  3. 如权利要求1所述的文本关键词提取方法,其中,所述根据预置的最优生成模型从所述词向量组中提取目标关键词向量的步骤之前包括:The text keyword extraction method according to claim 1, wherein the step of extracting the target keyword vector from the word vector group according to a preset optimal generation model includes:
    根据预置词向量库将预置的训练文本转换为对应的训练词向量组,并获取所述训练词向量组中的真实关键词向量;Converting the preset training text into the corresponding training word vector group according to the preset word vector library, and obtaining the real keyword vector in the training word vector group;
    将所述训练词向量组输入到最新生成模型中,并由所述最新生成模型输出从所述训练词向量组中提取预测关键词向量;Input the training word vector group into the latest generation model, and extract the predicted keyword vector from the training word vector group by the latest generation model output;
    将所述真实关键词向量和所述预测关键词向量输入到最新判别模型中,并由所述最新判别模型输出所述预测关键词向量与所述真实关键词向量的匹配概率;Input the real keyword vector and the predicted keyword vector into the latest discriminant model, and output the matching probability of the predicted keyword vector and the real keyword vector from the latest discriminant model;
    若所述匹配概率大于预设阈值,则所述最新生成模型为预置的最优生成模型。If the matching probability is greater than a preset threshold, the latest generation model is a preset optimal generation model.
  4. 如权利要求3所述的文本关键词提取方法,其中,所述由所述最新判别模型输出所述预测关键词向量与所述真实关键词向量的匹配概率的步骤之后包括:The text keyword extraction method according to claim 3, wherein the step of outputting the matching probability of the predicted keyword vector and the real keyword vector by the latest discriminant model includes:
    若所述匹配概率小于预设阈值,则根据所述匹配概率计算获得所述最新生成模型和所述最新判别模型各自的损失函数;If the matching probability is less than a preset threshold, calculating the respective loss functions of the latest generation model and the latest discriminant model according to the matching probability;
    根据所述最新生成模型和所述最新判别模型各自的损失函数对所述最新生成模型和所述最新判别模型各自的模型参数进行优化,以获得模型参数优化更新后的最新生成模型和最新判别模型;Optimize the model parameters of the latest generation model and the latest discriminant model according to the respective loss functions of the latest generation model and the latest discriminant model, so as to obtain the latest generation model and the latest discriminant model after the model parameters are optimized and updated ;
    执行所述将所述训练词向量组输入到最新生成模型中的步骤。The step of inputting the training word vector group into the latest generation model is performed.
  5. 如权利要求4所述的文本关键词提取方法,其中,所述根据所述最新生成模型和所述最新判别模型各自的损失函数对所述最新生成模型和所述最新判别模型各自的模型参数进行优化的步骤包括:The text keyword extraction method according to claim 4, wherein the model parameters of the latest generation model and the latest discrimination model are performed based on the loss function of each of the latest generation model and the latest discrimination model The optimization steps include:
    根据所述最新生成模型和所述最新判别模型各自的损失函数,通过ADAM算法对所述最新生成模型和所述最新判别模型各自的模型参数进行优化。According to the respective loss functions of the latest generation model and the latest discriminant model, the model parameters of the latest generation model and the latest discriminant model are optimized by the ADAM algorithm.
  6. 如权利要求1所述的文本关键词提取方法,其中,所述根据预置词向量库将所述待提取文本转换为对应的词向量组的步骤之前包括:The text keyword extraction method according to claim 1, wherein the step of converting the text to be extracted into a corresponding word vector group according to a preset word vector library includes:
    获取待转换的语料文本,将所述语料文本进行分词,获得分词后的待转换词组;Obtain the corpus text to be converted, segment the corpus text, and obtain the phrase to be converted after the word segmentation;
    将所述待转换词组中各待转换词转换成对应的词向量,将各所述待转换词与对应的词向量关联存储在预置词向量库。Convert each word to be converted in the word group to be converted into a corresponding word vector, and store each word to be converted in association with the corresponding word vector in a preset word vector library.
  7. 如权利要求3所述的文本关键词提取方法,其中,所述将所述真实关键词向量和所述预测关键词向量输入到最新判别模型中,与由所述最新判别模型输出所述预测关键词向量与所述真实关键词向量的匹配概率之间包括:The text keyword extraction method according to claim 3, wherein the input of the real keyword vector and the predicted keyword vector into the latest discriminant model, and output of the prediction key by the latest discriminant model The matching probability between the word vector and the real keyword vector includes:
    分别计算各预测关键词向量与每个真实关键词向量的欧氏距离;Calculate the Euclidean distance between each predicted keyword vector and each real keyword vector separately;
    统计与预设个数真实关键词向量的欧氏距离小于预设值的匹配预测词向量的数目,所述预设个数至少为一个;Counting the number of matching predicted word vectors whose Euclidean distance from a preset number of true keyword vectors is less than a preset value, the preset number is at least one;
    基于所述匹配预测词向量的数目计算所述预测关键词向量与所 述真实关键词向量的匹配概率。The matching probability of the predicted keyword vector and the real keyword vector is calculated based on the number of matching predicted word vectors.
  8. 一种文本关键词提取装置,其中,所述文本关键词提取装置包括:A text keyword extraction device, wherein the text keyword extraction device includes:
    第一向量转换模块,设置为获取待提取文本,根据预置词向量库将所述待提取文本转换为对应的词向量组;The first vector conversion module is configured to obtain text to be extracted, and convert the text to be extracted into corresponding word vector groups according to a preset word vector library;
    关键词生成模块,设置为根据预置的最优生成模型从所述词向量组中提取目标关键词向量;The keyword generation module is set to extract the target keyword vector from the word vector group according to a preset optimal generation model;
    第二向量转换模块,设置为根据所述预置词向量库,将所述目标关键词向量转换为对应的目标关键词,提取所述目标关键词作为所述待提取文本的文本关键词。The second vector conversion module is configured to convert the target keyword vector into the corresponding target keyword according to the preset word vector library, and extract the target keyword as the text keyword of the text to be extracted.
  9. 如权利要求8所述的文本关键词提取装置,其中,所述第二向量转换模块,还设置为遍历所述预置词向量库中所有的预置词向量,分别计算各所述预置词向量与所述目标关键词向量的欧氏距离;从所有的预置词向量中获取与所述目标关键词向量之间欧氏距离最小的匹配词向量,并从所述预置词向量库中获取所述匹配词向量对应的匹配词,所述匹配词为目标关键词。The text keyword extraction device according to claim 8, wherein the second vector conversion module is further configured to traverse all the preset word vectors in the preset word vector library, and calculate each preset word separately Euclidean distance between the vector and the target keyword vector; obtain the matching word vector with the smallest Euclidean distance from the target keyword vector from all the preset word vectors, and from the preset word vector library Acquire matching words corresponding to the matching word vector, where the matching words are target keywords.
  10. 如权利要求8所述的文本关键词提取装置,其中,所述文本关键词提取装置包括:The text keyword extraction device according to claim 8, wherein the text keyword extraction device comprises:
    训练模块,设置为根据预置词向量库将预置的训练文本转换为对应的训练词向量组,并获取所述训练词向量组中的真实关键词向量;将所述训练词向量组输入到最新生成模型中,并由所述最新生成模型输出从所述训练词向量组中提取预测关键词向量;将所述真实关键词向量和所述预测关键词向量输入到最新判别模型中,并由所述最新判别模型输出所述预测关键词向量与所述真实关键词向量的匹配概率;若所述匹配概率大于预设阈值,则所述最新生成模型为预置的最优生成模型。The training module is set to convert the preset training text into the corresponding training word vector group according to the preset word vector library, and obtain the real keyword vector in the training word vector group; input the training word vector group to The latest generated model, and the predicted keyword vector is extracted from the training word vector group by the latest generated model output; the real keyword vector and the predicted keyword vector are input into the latest discriminant model, and by The latest discriminant model outputs the matching probability of the predicted keyword vector and the real keyword vector; if the matching probability is greater than a preset threshold, the latest generation model is a preset optimal generation model.
  11. 如权利要求10所述的文本关键词提取装置,其中,所述训练模块,还设置为若所述匹配概率小于预设阈值,则根据所述匹配概率计算获得所述最新生成模型和所述最新判别模型各自的损失函数;根据所述最新生成模型和所述最新判别模型各自的损失函数对所述 最新生成模型和所述最新判别模型各自的模型参数进行优化,以获得模型参数优化更新后的最新生成模型和最新判别模型;执行所述将所述训练词向量组输入到最新生成模型中的步骤。The text keyword extraction device according to claim 10, wherein the training module is further configured to, if the matching probability is less than a preset threshold, calculate the latest generation model and the latest according to the matching probability The respective loss function of the discriminant model; optimize the model parameters of the latest generated model and the latest discriminant model according to the respective loss functions of the latest generated model and the latest discriminant model to obtain the optimized and updated model parameters The latest generation model and the latest discriminant model; performing the step of inputting the training word vector group into the latest generation model.
  12. 如权利要求8所述的文本关键词提取装置,其中,所述文本关键词提取装置包括:The text keyword extraction device according to claim 8, wherein the text keyword extraction device comprises:
    分词模块,设置为获取待转换的语料文本,将所述语料文本进行分词,获得分词后的待转换词组;The word segmentation module is set to obtain the text of the corpus to be converted, segment the text of the corpus to obtain the phrase to be converted after the word segmentation;
    向量转换模块,设置为将所述待转换词组中各待转换词转换成对应的词向量,将各所述待转换词与对应的词向量关联存储在预置词向量库。The vector conversion module is configured to convert each word to be converted in the phrase to be converted into a corresponding word vector, and store each word to be converted in association with the corresponding word vector in a preset word vector library.
  13. 一种文本关键词提取设备,其中,所述文本关键词提取设备包括处理器、存储器、以及存储在所述存储器上并可被所述处理器执行的计算机可读指令,其中所述计算机可读指令被所述处理器执行时,实现如下步骤:A text keyword extraction device, wherein the text keyword extraction device includes a processor, a memory, and computer readable instructions stored on the memory and executable by the processor, wherein the computer is readable When the instruction is executed by the processor, the following steps are implemented:
    获取待提取文本,根据预置词向量库将所述待提取文本转换为对应的词向量组;Obtain the text to be extracted, and convert the text to be extracted into the corresponding word vector group according to the preset word vector library;
    根据预置的最优生成模型从所述词向量组中提取目标关键词向量;Extract a target keyword vector from the word vector group according to a preset optimal generation model;
    根据所述预置词向量库,将所述目标关键词向量转换为对应的目标关键词,提取所述目标关键词作为所述待提取文本的文本关键词。According to the preset word vector library, the target keyword vector is converted into a corresponding target keyword, and the target keyword is extracted as a text keyword of the text to be extracted.
  14. 如权利要求13所述的文本关键词提取设备,其中,所述根据所述预置词向量库,将所述目标关键词向量转换为对应的目标关键词的步骤包括:The text keyword extraction device according to claim 13, wherein the step of converting the target keyword vector into the corresponding target keyword according to the preset word vector library includes:
    遍历所述预置词向量库中所有的预置词向量,分别计算各所述预置词向量与所述目标关键词向量的欧氏距离;Traverse all the preset word vectors in the preset word vector library, and calculate the Euclidean distance between each preset word vector and the target keyword vector;
    从所有的预置词向量中获取与所述目标关键词向量之间欧氏距离最小的匹配词向量,并从所述预置词向量库中获取所述匹配词向量对应的匹配词,所述匹配词为目标关键词。Obtain the matching word vector with the smallest Euclidean distance from the target keyword vector from all the preset word vectors, and obtain the matching words corresponding to the matching word vector from the preset word vector library, the Matching words are target keywords.
  15. 如权利要求13所述的文本关键词提取设备,其中,所述根据预置的最优生成模型从所述词向量组中提取目标关键词向量的步 骤之前包括:The text keyword extraction device according to claim 13, wherein the step of extracting the target keyword vector from the word vector group according to a preset optimal generation model includes:
    根据预置词向量库将预置的训练文本转换为对应的训练词向量组,并获取所述训练词向量组中的真实关键词向量;Converting the preset training text into the corresponding training word vector group according to the preset word vector library, and obtaining the real keyword vector in the training word vector group;
    将所述训练词向量组输入到最新生成模型中,并由所述最新生成模型输出从所述训练词向量组中提取预测关键词向量;Input the training word vector group into the latest generation model, and extract the predicted keyword vector from the training word vector group by the latest generation model output;
    将所述真实关键词向量和所述预测关键词向量输入到最新判别模型中,并由所述最新判别模型输出所述预测关键词向量与所述真实关键词向量的匹配概率;Input the real keyword vector and the predicted keyword vector into the latest discriminant model, and output the matching probability of the predicted keyword vector and the real keyword vector from the latest discriminant model;
    若所述匹配概率大于预设阈值,则所述最新生成模型为预置的最优生成模型。If the matching probability is greater than a preset threshold, the latest generation model is a preset optimal generation model.
  16. 如权利要求15所述的文本关键词提取设备,其中,所述由所述最新判别模型输出所述预测关键词向量与所述真实关键词向量的匹配概率的步骤之后包括:The text keyword extraction device according to claim 15, wherein the step of outputting the matching probability of the predicted keyword vector and the real keyword vector by the latest discriminant model includes:
    若所述匹配概率小于预设阈值,则根据所述匹配概率计算获得所述最新生成模型和所述最新判别模型各自的损失函数;If the matching probability is less than a preset threshold, calculating the respective loss functions of the latest generation model and the latest discriminant model according to the matching probability;
    根据所述最新生成模型和所述最新判别模型各自的损失函数对所述最新生成模型和所述最新判别模型各自的模型参数进行优化,以获得模型参数优化更新后的最新生成模型和最新判别模型;Optimize the model parameters of the latest generation model and the latest discriminant model according to the respective loss functions of the latest generation model and the latest discriminant model, so as to obtain the latest generation model and the latest discriminant model after the model parameters are optimized and updated ;
    执行所述将所述训练词向量组输入到最新生成模型中的步骤。The step of inputting the training word vector group into the latest generation model is performed.
  17. 如权利要求13所述的文本关键词提取设备,其中,所述所述根据预置词向量库将所述待提取文本转换为对应的词向量组的步骤之前包括:The text keyword extraction device according to claim 13, wherein the step of converting the text to be extracted into a corresponding word vector group according to a preset word vector library includes:
    获取待转换的语料文本,将所述语料文本进行分词,获得分词后的待转换词组;Obtain the corpus text to be converted, segment the corpus text, and obtain the phrase to be converted after the word segmentation;
    将所述待转换词组中各待转换词转换成对应的词向量,将各所述待转换词与对应的词向量关联存储在预置词向量库。Convert each word to be converted in the word group to be converted into a corresponding word vector, and store each word to be converted in association with the corresponding word vector in a preset word vector library.
  18. 一种计算机可读存储介质,其中,所述计算机可读存储介质上存储有计算机可读指令,其中所述计算机可读指令被处理器执行时,实现如下步骤:A computer-readable storage medium, wherein computer-readable instructions are stored on the computer-readable storage medium, and when the computer-readable instructions are executed by a processor, the following steps are implemented:
    获取待提取文本,根据预置词向量库将所述待提取文本转换为对 应的词向量组;Obtain the text to be extracted, and convert the text to be extracted into the corresponding word vector group according to the preset word vector library;
    根据预置的最优生成模型从所述词向量组中提取目标关键词向量;Extract a target keyword vector from the word vector group according to a preset optimal generation model;
    根据所述预置词向量库,将所述目标关键词向量转换为对应的目标关键词,提取所述目标关键词作为所述待提取文本的文本关键词。According to the preset word vector library, the target keyword vector is converted into a corresponding target keyword, and the target keyword is extracted as a text keyword of the text to be extracted.
  19. 如权利要求18所述的计算机可读存储介质,其中,所述根据所述预置词向量库,将所述目标关键词向量转换为对应的目标关键词的步骤包括:The computer-readable storage medium of claim 18, wherein the step of converting the target keyword vector into the corresponding target keyword according to the preset word vector library includes:
    遍历所述预置词向量库中所有的预置词向量,分别计算各所述预置词向量与所述目标关键词向量的欧氏距离;Traverse all the preset word vectors in the preset word vector library, and calculate the Euclidean distance between each preset word vector and the target keyword vector;
    从所有的预置词向量中获取与所述目标关键词向量之间欧氏距离最小的匹配词向量,并从所述预置词向量库中获取所述匹配词向量对应的匹配词,所述匹配词为目标关键词。Obtain the matching word vector with the smallest Euclidean distance from the target keyword vector from all the preset word vectors, and obtain the matching words corresponding to the matching word vector from the preset word vector library, the Matching words are target keywords.
  20. 如权利要求18所述的计算机可读存储介质,其中,所述根据预置的最优生成模型从所述词向量组中提取目标关键词向量的步骤之前包括:The computer-readable storage medium of claim 18, wherein the step of extracting the target keyword vector from the word vector group according to a preset optimal generation model includes:
    根据预置词向量库将预置的训练文本转换为对应的训练词向量组,并获取所述训练词向量组中的真实关键词向量;Converting the preset training text into the corresponding training word vector group according to the preset word vector library, and obtaining the real keyword vector in the training word vector group;
    将所述训练词向量组输入到最新生成模型中,并由所述最新生成模型输出从所述训练词向量组中提取预测关键词向量;Input the training word vector group into the latest generation model, and extract the predicted keyword vector from the training word vector group by the latest generation model output;
    将所述真实关键词向量和所述预测关键词向量输入到最新判别模型中,并由所述最新判别模型输出所述预测关键词向量与所述真实关键词向量的匹配概率;Input the real keyword vector and the predicted keyword vector into the latest discriminant model, and output the matching probability of the predicted keyword vector and the real keyword vector from the latest discriminant model;
    若所述匹配概率大于预设阈值,则所述最新生成模型为预置的最优生成模型。If the matching probability is greater than a preset threshold, the latest generation model is a preset optimal generation model.
PCT/CN2018/122813 2018-10-25 2018-12-21 Method, apparatus and device for extracting text keyword, as well as computer readable storage medium WO2020082560A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811254895.6A CN109635273B (en) 2018-10-25 2018-10-25 Text keyword extraction method, device, equipment and storage medium
CN201811254895.6 2018-10-25

Publications (1)

Publication Number Publication Date
WO2020082560A1 true WO2020082560A1 (en) 2020-04-30

Family

ID=66066687

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/122813 WO2020082560A1 (en) 2018-10-25 2018-12-21 Method, apparatus and device for extracting text keyword, as well as computer readable storage medium

Country Status (2)

Country Link
CN (1) CN109635273B (en)
WO (1) WO2020082560A1 (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111753091A (en) * 2020-06-30 2020-10-09 北京小米松果电子有限公司 Classification method, classification model training method, device, equipment and storage medium
CN111798352A (en) * 2020-05-22 2020-10-20 平安国际智慧城市科技股份有限公司 Enterprise state supervision method, device, equipment and computer readable storage medium
CN112015884A (en) * 2020-08-28 2020-12-01 欧冶云商股份有限公司 Method and device for extracting keywords of user visiting data and storage medium
CN112037912A (en) * 2020-09-09 2020-12-04 平安科技(深圳)有限公司 Triage model training method, device and equipment based on medical knowledge map
CN112100335A (en) * 2020-09-25 2020-12-18 北京百度网讯科技有限公司 Question generation method, model training method, device, equipment and storage medium
CN112100405A (en) * 2020-09-23 2020-12-18 中国农业大学 Veterinary drug residue knowledge graph construction method based on weighted LDA
CN112735413A (en) * 2020-12-25 2021-04-30 浙江大华技术股份有限公司 Instruction analysis method based on camera device, electronic equipment and storage medium
CN112949906A (en) * 2021-02-04 2021-06-11 杭州品茗安控信息技术股份有限公司 Matching method, device, equipment and storage medium for engineering cost quota conversion
CN113051372A (en) * 2021-04-12 2021-06-29 平安国际智慧城市科技股份有限公司 Material data processing method and device, computer equipment and storage medium
CN113111663A (en) * 2021-04-28 2021-07-13 东南大学 Abstract generation method fusing key information
CN113609292A (en) * 2021-08-09 2021-11-05 上海交通大学 Known false news intelligent detection method based on graph structure
CN114491062A (en) * 2021-12-30 2022-05-13 中国科学院计算机网络信息中心 Short text classification method fusing knowledge graph and topic model
CN114706942A (en) * 2022-03-16 2022-07-05 马上消费金融股份有限公司 Text conversion model training method, text conversion device and electronic equipment
CN116167344A (en) * 2023-02-17 2023-05-26 广州市奇之信息技术有限公司 Automatic text generation method for deep learning creative science and technology
CN117009457A (en) * 2023-06-02 2023-11-07 国网江苏省电力有限公司南京供电分公司 Power grid operation and maintenance similar fault determining method, system and storage medium
CN118069791A (en) * 2024-04-22 2024-05-24 菏泽市产品检验检测研究院 Intelligent electronic archive retrieval method and system

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110362742A (en) * 2019-06-18 2019-10-22 平安普惠企业管理有限公司 Curriculum information matching process, device, computer equipment and storage medium
CN110378563A (en) * 2019-06-18 2019-10-25 平安普惠企业管理有限公司 Information processing method, device, computer equipment and storage medium
CN112307199A (en) * 2019-07-14 2021-02-02 阿里巴巴集团控股有限公司 Information identification method, data processing method, device and equipment, information interaction method
CN110765767B (en) * 2019-09-19 2024-01-19 平安科技(深圳)有限公司 Extraction method, device, server and storage medium of local optimization keywords
CN111191689B (en) * 2019-12-16 2023-09-12 恩亿科(北京)数据科技有限公司 Sample data processing method and device
CN111325641B (en) * 2020-02-18 2023-08-29 北京百度网讯科技有限公司 Method and device for determining recommended criminal investigation range, electronic equipment and medium
CN112328655B (en) * 2020-11-02 2024-05-24 中国平安人寿保险股份有限公司 Text label mining method, device, equipment and storage medium
CN112699675B (en) * 2020-12-30 2023-09-12 平安科技(深圳)有限公司 Text processing method, device, equipment and computer readable storage medium
CN112818688B (en) * 2021-04-16 2021-06-25 腾讯科技(深圳)有限公司 Text processing method, device, equipment and storage medium
CN113240562A (en) * 2021-05-27 2021-08-10 南通大学 Method and system for recommending and matching obstetrical and academic research projects based on nlp
CN113283235B (en) * 2021-07-21 2021-11-19 明品云(北京)数据科技有限公司 User label prediction method and system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107704503A (en) * 2017-08-29 2018-02-16 平安科技(深圳)有限公司 User's keyword extracting device, method and computer-readable recording medium
CN108133045A (en) * 2018-01-12 2018-06-08 广州杰赛科技股份有限公司 Keyword extracting method and system, keyword extraction model generating method and system
CN108304364A (en) * 2017-02-23 2018-07-20 腾讯科技(深圳)有限公司 keyword extracting method and device
WO2018183306A1 (en) * 2017-03-29 2018-10-04 Ebay Inc. Generating keywords by associative context with input

Family Cites Families (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3441400B2 (en) * 1998-06-04 2003-09-02 松下電器産業株式会社 Language conversion rule creation device and program recording medium
US7093137B1 (en) * 1999-09-30 2006-08-15 Casio Computer Co., Ltd. Database management apparatus and encrypting/decrypting system
US9953632B2 (en) * 2014-04-17 2018-04-24 Qualcomm Incorporated Keyword model generation for detecting user-defined keyword
CN106021272B (en) * 2016-04-04 2019-11-19 上海大学 The keyword extraction method calculated based on distributed expression term vector
CN105930318B (en) * 2016-04-11 2018-10-19 深圳大学 A kind of term vector training method and system
CN106803082A (en) * 2017-01-23 2017-06-06 重庆邮电大学 A kind of online handwriting recognition methods based on conditional generation confrontation network
CN107168954B (en) * 2017-05-18 2021-03-26 北京奇艺世纪科技有限公司 Text keyword generation method and device, electronic equipment and readable storage medium
CN107330444A (en) * 2017-05-27 2017-11-07 苏州科技大学 A kind of image autotext mask method based on generation confrontation network
CN108197525B (en) * 2017-11-20 2020-08-11 中国科学院自动化研究所 Face image generation method and device
CN108563624A (en) * 2018-01-03 2018-09-21 清华大学深圳研究生院 A kind of spatial term method based on deep learning
CN108319668B (en) * 2018-01-23 2021-04-20 义语智能科技(上海)有限公司 Method and equipment for generating text abstract
CN108334497A (en) * 2018-02-06 2018-07-27 北京航空航天大学 The method and apparatus for automatically generating text
CN108460104B (en) * 2018-02-06 2021-06-18 北京奇虎科技有限公司 Method and device for customizing content
CN108446334B (en) * 2018-02-23 2021-08-03 浙江工业大学 Image retrieval method based on content for unsupervised countermeasure training
CN108491497B (en) * 2018-03-20 2020-06-02 苏州大学 Medical text generation method based on generation type confrontation network technology

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108304364A (en) * 2017-02-23 2018-07-20 腾讯科技(深圳)有限公司 keyword extracting method and device
WO2018183306A1 (en) * 2017-03-29 2018-10-04 Ebay Inc. Generating keywords by associative context with input
CN107704503A (en) * 2017-08-29 2018-02-16 平安科技(深圳)有限公司 User's keyword extracting device, method and computer-readable recording medium
CN108133045A (en) * 2018-01-12 2018-06-08 广州杰赛科技股份有限公司 Keyword extracting method and system, keyword extraction model generating method and system

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111798352A (en) * 2020-05-22 2020-10-20 平安国际智慧城市科技股份有限公司 Enterprise state supervision method, device, equipment and computer readable storage medium
CN111753091A (en) * 2020-06-30 2020-10-09 北京小米松果电子有限公司 Classification method, classification model training method, device, equipment and storage medium
CN112015884A (en) * 2020-08-28 2020-12-01 欧冶云商股份有限公司 Method and device for extracting keywords of user visiting data and storage medium
CN112037912A (en) * 2020-09-09 2020-12-04 平安科技(深圳)有限公司 Triage model training method, device and equipment based on medical knowledge map
CN112037912B (en) * 2020-09-09 2023-07-11 平安科技(深圳)有限公司 Triage model training method, device and equipment based on medical knowledge graph
CN112100405B (en) * 2020-09-23 2024-01-30 中国农业大学 Veterinary drug residue knowledge graph construction method based on weighted LDA
CN112100405A (en) * 2020-09-23 2020-12-18 中国农业大学 Veterinary drug residue knowledge graph construction method based on weighted LDA
CN112100335A (en) * 2020-09-25 2020-12-18 北京百度网讯科技有限公司 Question generation method, model training method, device, equipment and storage medium
CN112100335B (en) * 2020-09-25 2024-05-03 北京百度网讯科技有限公司 Problem generation method, model training method, device, equipment and storage medium
CN112735413A (en) * 2020-12-25 2021-04-30 浙江大华技术股份有限公司 Instruction analysis method based on camera device, electronic equipment and storage medium
CN112735413B (en) * 2020-12-25 2024-05-31 浙江大华技术股份有限公司 Instruction analysis method based on camera device, electronic equipment and storage medium
CN112949906B (en) * 2021-02-04 2024-03-19 品茗科技股份有限公司 Matching method, device, equipment and storage medium for engineering cost quota conversion
CN112949906A (en) * 2021-02-04 2021-06-11 杭州品茗安控信息技术股份有限公司 Matching method, device, equipment and storage medium for engineering cost quota conversion
CN113051372A (en) * 2021-04-12 2021-06-29 平安国际智慧城市科技股份有限公司 Material data processing method and device, computer equipment and storage medium
CN113051372B (en) * 2021-04-12 2024-05-07 平安国际智慧城市科技股份有限公司 Material data processing method, device, computer equipment and storage medium
CN113111663A (en) * 2021-04-28 2021-07-13 东南大学 Abstract generation method fusing key information
CN113609292A (en) * 2021-08-09 2021-11-05 上海交通大学 Known false news intelligent detection method based on graph structure
CN113609292B (en) * 2021-08-09 2023-10-13 上海交通大学 Known false news intelligent detection method based on graph structure
CN114491062B (en) * 2021-12-30 2024-05-03 中国科学院计算机网络信息中心 Short text classification method integrating knowledge graph and topic model
CN114491062A (en) * 2021-12-30 2022-05-13 中国科学院计算机网络信息中心 Short text classification method fusing knowledge graph and topic model
CN114706942A (en) * 2022-03-16 2022-07-05 马上消费金融股份有限公司 Text conversion model training method, text conversion device and electronic equipment
CN114706942B (en) * 2022-03-16 2023-11-24 马上消费金融股份有限公司 Text conversion model training method, text conversion device and electronic equipment
CN116167344B (en) * 2023-02-17 2023-10-27 广州市奇之信息技术有限公司 Automatic text generation method for deep learning creative science and technology
CN116167344A (en) * 2023-02-17 2023-05-26 广州市奇之信息技术有限公司 Automatic text generation method for deep learning creative science and technology
CN117009457A (en) * 2023-06-02 2023-11-07 国网江苏省电力有限公司南京供电分公司 Power grid operation and maintenance similar fault determining method, system and storage medium
CN118069791A (en) * 2024-04-22 2024-05-24 菏泽市产品检验检测研究院 Intelligent electronic archive retrieval method and system

Also Published As

Publication number Publication date
CN109635273B (en) 2023-04-25
CN109635273A (en) 2019-04-16

Similar Documents

Publication Publication Date Title
WO2020082560A1 (en) Method, apparatus and device for extracting text keyword, as well as computer readable storage medium
CN107491534B (en) Information processing method and device
US11017178B2 (en) Methods, devices, and systems for constructing intelligent knowledge base
CN107066464B (en) Semantic natural language vector space
US10606946B2 (en) Learning word embedding using morphological knowledge
US20180336193A1 (en) Artificial Intelligence Based Method and Apparatus for Generating Article
US20210150142A1 (en) Method and apparatus for determining feature words and server
Sun et al. Sentiment analysis for Chinese microblog based on deep neural networks with convolutional extension features
CN109670163B (en) Information identification method, information recommendation method, template construction method and computing device
CN113239700A (en) Text semantic matching device, system, method and storage medium for improving BERT
CN116775847B (en) Question answering method and system based on knowledge graph and large language model
CN109344240B (en) Data processing method, server and electronic equipment
WO2015135455A1 (en) Natural language question answering method and apparatus
CN110457708B (en) Vocabulary mining method and device based on artificial intelligence, server and storage medium
KR102491172B1 (en) Natural language question-answering system and learning method
CN111931500B (en) Search information processing method and device
WO2020253042A1 (en) Intelligent sentiment judgment method and device, and computer readable storage medium
CN112487190B (en) Method for extracting relationships between entities from text based on self-supervision and clustering technology
WO2021139107A1 (en) Intelligent emotion recognition method and apparatus, electronic device, and storage medium
CN110502610A (en) Intelligent sound endorsement method, device and medium based on text semantic similarity
US20200027446A1 (en) Visualization interface for voice input
CN111737997A (en) Text similarity determination method, text similarity determination equipment and storage medium
KR101545050B1 (en) Method for automatically classifying answer type and apparatus, question-answering system for using the same
CN114329225A (en) Search method, device, equipment and storage medium based on search statement
CN114003682A (en) Text classification method, device, equipment and storage medium

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18937866

Country of ref document: EP

Kind code of ref document: A1