CN115827815A - Keyword extraction method and device based on small sample learning - Google Patents

Keyword extraction method and device based on small sample learning Download PDF

Info

Publication number
CN115827815A
CN115827815A CN202211459952.0A CN202211459952A CN115827815A CN 115827815 A CN115827815 A CN 115827815A CN 202211459952 A CN202211459952 A CN 202211459952A CN 115827815 A CN115827815 A CN 115827815A
Authority
CN
China
Prior art keywords
text data
model
keyword
keyword list
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211459952.0A
Other languages
Chinese (zh)
Other versions
CN115827815B (en
Inventor
马晓亮
安玲玲
朱栩
陈茂强
邓从健
杜德泉
黄建文
古风云
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Yunqu Information Technology Co ltd
Guangzhou Institute of Technology of Xidian University
Original Assignee
Guangzhou Yunqu Information Technology Co ltd
Guangzhou Institute of Technology of Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Yunqu Information Technology Co ltd, Guangzhou Institute of Technology of Xidian University filed Critical Guangzhou Yunqu Information Technology Co ltd
Priority to CN202211459952.0A priority Critical patent/CN115827815B/en
Publication of CN115827815A publication Critical patent/CN115827815A/en
Application granted granted Critical
Publication of CN115827815B publication Critical patent/CN115827815B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Machine Translation (AREA)

Abstract

The application provides a keyword extraction method and device based on small sample learning, and the keyword extraction method based on the small sample learning comprises the following steps: acquiring first text data; inputting the first text data into a Bi LSTM model to obtain a text sequence with a preset length; inputting the text sequence into a Bi Lstm-CRF model to obtain a first keyword list of the first text data; inputting the first text data into a preset keyword extraction model for keyword extraction to obtain a second keyword list of the first text data; judging whether the first keyword list and the second keyword list are the same; if the first keyword list is different from the second keyword list, updating the Bi LSTM model and the Bi Lstm-CRF model by using the first text data; and if the first keyword list and the second keyword list are the same, determining the first keyword list of the first text data as the extracted target keyword list. The content identification method and the content identification device can improve the accuracy of the content identification method.

Description

Keyword extraction method and device based on small sample learning
Technical Field
The application relates to the technical field of artificial intelligence, in particular to a keyword extraction method and device based on small sample learning.
Background
At present, a large number of keyword extraction models based on the general field exist, and are mainly divided into two types, namely unsupervised type and supervised type. However, in the field of communication subdivision, the volume of customer service communication data is huge, the cost for constructing a corpus is correspondingly increased, and new difficulty is brought to subsequent correction and update, so that the conventional keyword extraction model has several problems: (1) The problem that the accuracy rate is low, the association between the extracted subject word and the document is weak, the document subject cannot be well interpreted and the like exists when the key word is extracted by using an unsupervised method, and although the accuracy rate of extracting the key word is improved by using the supervised method, a large amount of high-quality manually labeled corpora are needed. The establishment of a large-scale high-quality labeling training set in the subdivision field requires a large amount of labor cost, which is not practical. (2) The customer has random words, and the phenomenon that the dialect is mixed in the dialect or the common Chinese exists. (3) The training cost is high, and the training of large-scale samples needs to consume a large amount of training time and calculation power.
That is, the keyword extraction method in the prior art has low accuracy.
Disclosure of Invention
The application aims to provide a keyword extraction method and device based on small sample learning, and aims to solve the problem that the keyword extraction method in the prior art is low in accuracy.
On one hand, the application provides a keyword extraction method based on small sample learning, and the keyword extraction method based on small sample learning comprises the following steps:
acquiring first text data;
inputting the first text data into a BilSTM model to obtain a text sequence with a preset length;
inputting the text sequence into a BiLstm-CRF model to obtain a first keyword list of the first text data;
inputting the first text data into a preset keyword extraction model for keyword extraction to obtain a second keyword list of the first text data;
judging whether the first keyword list and the second keyword list are the same;
if the first keyword list is different from the second keyword list, updating a BiLSTM model and a BiLstm-CRF model by using the first text data; and if the first keyword list and the second keyword list are the same, determining the first keyword list of the first text data as an extracted target keyword list.
Further, the acquiring the first text data includes:
acquiring communication conversation voice data;
carrying out voice recognition on the communication conversation voice data to obtain second text data;
the first text data is determined from the second text data.
Further, the determining the first text data according to the second text data includes:
judging the dialect type of the second text data according to a preset dialect dictionary;
determining a corresponding dialect mandarin mapping dictionary according to the dialect type of the second text data;
mapping the second text data to third text data of a Mandarin type according to the dialect Mandarin mapping dictionary;
the first text data is determined from the third text data.
Further, the determining the first text data according to the third text data includes:
merging synonyms in the third text data to obtain synonym merged text data;
and inputting the synonym merged text data into a target abstract generation model for dimensionality reduction to obtain first text data.
Further, the target abstract generation model comprises an encoder sub-model and a decoder sub-model, and the network layer of the encoder sub-model comprises a multi-head self-attention mechanism sub-network layer and a fully-connected feedforward sub-network layer; the network layers of the decoder submodel include a multi-headed self-attention mechanism sub-network layer, an attention sub-network layer, and a fully connected feed-forward sub-network layer.
Further, the inputting the first text data into the BiLSTM model to obtain a text sequence with a preset length includes:
removing preset linguistic and semantic words and preset nonsense words of the first text data to obtain removed text data;
inputting the removed text data into a pre-trained character vector model, reading characters of the removed text data, and acquiring a character vector list;
and inputting the word vector list into a BilSTM model to obtain a text sequence with a preset length.
Further, before the obtaining the first text data, the method includes:
performing text preprocessing on the training text set to obtain preprocessed text data;
performing preset linguistic and semantic word elimination, characteristic word extraction and restoration labeling on the preprocessed text data to obtain a target training vector and corresponding target label data;
and training the BilSTM-CRF model based on the target training vector and the corresponding target label data.
In one aspect, the present application provides a keyword extraction device based on small sample learning, the keyword extraction device based on small sample learning includes:
an acquisition unit configured to acquire first text data;
the text acquisition unit is used for inputting the first text data into the BilSTM model to obtain a text sequence with a preset length;
the first keyword extraction unit is used for inputting the text sequence into a BiLstm-CRF model to obtain a first keyword list of the first text data;
the second keyword extraction unit is used for inputting the first text data into a preset keyword extraction model for keyword extraction to obtain a second keyword list of the first text data;
a judging unit configured to judge whether the first keyword list and the second keyword list are the same;
a determining unit, which updates a BilStm model and a BilStm-CRF model by using the first text data if the first keyword list and the second keyword list are different; and if the first keyword list and the second keyword list are the same, determining the first keyword list of the first text data as an extracted target keyword list.
Further, the obtaining unit is configured to:
acquiring communication conversation voice data;
carrying out voice recognition on the communication conversation voice data to obtain second text data;
the first text data is determined from the second text data.
Further, the obtaining unit is configured to:
judging the dialect type of the second text data according to a preset dialect dictionary;
determining a corresponding dialect Mandarin mapping dictionary according to the dialect type of the second text data;
mapping the second text data to third text data of a Mandarin type according to the dialect Mandarin mapping dictionary;
the first text data is determined from the third text data.
Further, the obtaining unit is configured to:
merging synonyms in the third text data to obtain synonym merged text data;
and inputting the synonym merged text data into a target abstract generation model for dimensionality reduction to obtain first text data.
Further, the target abstract generation model comprises an encoder sub-model and a decoder sub-model, and the network layer of the encoder sub-model comprises a multi-head self-attention mechanism sub-network layer and a fully-connected feedforward sub-network layer; the network layers of the decoder submodel include a multi-headed self-attention mechanism sub-network layer, an attention sub-network layer, and a fully connected feed-forward sub-network layer.
Further, the text obtaining unit is configured to include:
removing preset linguistic qi words and preset nonsense words of the first text data to obtain removed text data;
inputting the removed text data into a pre-trained character vector model, reading characters of the removed text data, and acquiring a character vector list;
and inputting the word vector list into a BilSTM model to obtain a text sequence with a preset length.
Further, the obtaining unit is configured to:
performing text preprocessing on the training text set to obtain preprocessed text data;
performing preset linguistic and semantic word elimination, characteristic word extraction and restoration labeling on the preprocessed text data to obtain a target training vector and corresponding target label data;
and training the BilSTM-CRF model based on the target training vector and the corresponding target label data.
In one aspect, the present application further provides an electronic device, including:
one or more processors;
a memory; and
one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the processor to implement the small sample learning based keyword extraction method of any of the first aspects.
In one aspect, the present application further provides a computer-readable storage medium, on which a computer program is stored, where the computer program is loaded by a processor to execute the steps in the small sample learning-based keyword extraction method according to any one of the first aspect.
The application provides a keyword extraction method based on small sample learning, which comprises the following steps: acquiring first text data; inputting the first text data into a BilSTM model to obtain a text sequence with a preset length; inputting the text sequence into a BiLstm-CRF model to obtain a first keyword list of the first text data; inputting the first text data into a preset keyword extraction model for keyword extraction to obtain a second keyword list of the first text data; judging whether the first keyword list and the second keyword list are the same; if the first keyword list is different from the second keyword list, updating a BiLSTM model and a BiLstm-CRF model by using the first text data; and if the first keyword list and the second keyword list are the same, determining the first keyword list of the first text data as the extracted target keyword list. The method and the device use the unmatched keywords for small sample correction training, and improve the overall accuracy while saving calculation power.
Further, the problem that the existing keyword extraction method depends on a large number of manual labels is solved, the invention creates an information extraction technology and a synonym combination technology based on a deep neural network, and provides an automatic keyword extraction method based on a subdivision field.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a scene schematic diagram of a keyword extraction system based on small sample learning according to an embodiment of the present application;
fig. 2 is a schematic flowchart of an embodiment of a keyword extraction method based on small sample learning according to an embodiment of the present application;
FIG. 3 is a schematic structural diagram of an embodiment of a small sample learning-based keyword extraction apparatus provided in an embodiment of the present application;
fig. 4 is a schematic structural diagram of an embodiment of an electronic device provided in the embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
In the description of the present application, it is to be understood that the terms "center", "longitudinal", "lateral", "length", "width", "thickness", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", etc. indicate orientations or positional relationships based on those shown in the drawings, merely for convenience of description and simplicity of description, and do not indicate or imply that the device or element referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be considered limiting of the present application. Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more features. In the description of the present application, "a plurality" means two or more unless specifically limited otherwise.
In this application, the word "exemplary" is used to mean "serving as an example, instance, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments. The following description is presented to enable any person skilled in the art to make and use the application. In the following description, details are set forth for the purpose of explanation. It will be apparent to one of ordinary skill in the art that the present application may be practiced without these specific details. In other instances, well-known structures and processes are not set forth in detail in order to avoid obscuring the description of the present application with unnecessary detail. Thus, the present application is not intended to be limited to the embodiments shown, but is to be accorded the widest scope consistent with the principles and features disclosed herein.
It should be noted that, since the method in the embodiment of the present application is executed in the electronic device, the processing objects of each electronic device all exist in the form of data or information, for example, time, which is substantially time information, and it is understood that, if the size, the number, the position, and the like are mentioned in the following embodiments, all corresponding data exist so as to be processed by the electronic device, and details are not described herein.
The embodiment of the application provides a keyword extraction method and device based on small sample learning, which are described in detail below.
Referring to fig. 1, fig. 1 is a schematic view of a scenario of a small sample learning-based keyword extraction system according to an embodiment of the present disclosure, where the small sample learning-based keyword extraction system may include an electronic device 100, and a small sample learning-based keyword extraction apparatus, such as the electronic device in fig. 1, is integrated in the electronic device 100.
In this embodiment of the application, the electronic device 100 may be an independent server, or may be a server network or a server cluster composed of servers, for example, the electronic device 100 described in this embodiment of the application includes, but is not limited to, a computer, a network host, a single network server, multiple network server sets, or a cloud server composed of multiple servers. Among them, the Cloud server is constituted by a large number of computers or web servers based on Cloud Computing (Cloud Computing).
Those skilled in the art will understand that the application environment shown in fig. 1 is only one application scenario of the present application scheme, and does not constitute a limitation on the application scenario of the present application scheme, and that other application environments may further include more or fewer electronic devices than those shown in fig. 1, for example, only 1 electronic device is shown in fig. 1, and it is understood that the keyword extraction system based on small sample learning may further include one or more other servers, which is not limited herein.
In addition, as shown in fig. 1, the keyword extraction system based on small sample learning may further include a memory 200 for storing data.
It should be noted that the scenario diagram of the keyword extraction system based on small sample learning shown in fig. 1 is only an example, and the keyword extraction system based on small sample learning and the scenario described in the embodiment of the present application are for more clearly illustrating the technical solution of the embodiment of the present application, and do not form a limitation on the technical solution provided in the embodiment of the present application.
First, an embodiment of the present application provides a small sample learning-based keyword extraction method, where an execution subject of the small sample learning-based keyword extraction method is a small sample learning-based keyword extraction device, and the small sample learning-based keyword extraction device is applied to an electronic device, and the small sample learning-based keyword extraction method includes: acquiring first text data; inputting the first text data into a BilSTM model to obtain a text sequence with a preset length; inputting the text sequence into a BiLstm-CRF model to obtain a first keyword list of the first text data; inputting the first text data into a preset keyword extraction model for keyword extraction to obtain a second keyword list of the first text data; judging whether the first keyword list and the second keyword list are the same; if the first keyword list is different from the second keyword list, updating a BiLSTM model and a BiLstm-CRF model by using the first text data; and if the first keyword list and the second keyword list are the same, determining the first keyword list of the first text data as the extracted target keyword list.
Referring to fig. 2, fig. 2 is a schematic flowchart of an embodiment of a keyword extraction method based on small sample learning according to an embodiment of the present disclosure. The keyword extraction method based on small sample learning comprises the following steps:
s201, acquiring first text data.
In this embodiment of the application, before acquiring the first text data, the method includes:
and performing text preprocessing on the training text set to obtain preprocessed text data.
In the embodiment of the application, the training text set is obtained by performing ASR on the communication dialogue voice data in the historical time period.
Because there are a lot of dialects (such as cantonese) and colloquial expressions in the custom complaint text data set, it is necessary to map the contents expressed by the dialects into mandarin with the same meaning, and map words with different expression modes but the same meaning into the same word for uniform training.
Specifically, firstly, judging the dialect type of a training text set according to a preset dialect dictionary; determining a corresponding dialect Mandarin mapping dictionary according to the dialect type of the training text set; the training text set is mapped to a mandarin text set of a mandarin type according to a dialect mandarin mapping dictionary.
Because the original text data volume is large, text dimension reduction is needed to reduce the calculation amount. Specifically, the number of tags in the third text data is first reduced using text multi-layer convolution by fusing the measurement and the T5 model of the MLM algorithm. And merging synonyms in the Mandarin text set to obtain synonym merged text data. Specifically, words with different expression modes but the same meaning are mapped to the same word for uniform training. And inputting the synonym merged text data into a target abstract generation model for dimensionality reduction to obtain preprocessed text data.
The target abstract generation model comprises an encoder submodel and a decoder submodel, wherein the network layer of the encoder submodel comprises a multi-head self-attention mechanism sub-network layer and a fully-connected feedforward sub-network layer; the network layers of the decoder submodel include a multi-headed self-attention mechanism sub-network layer, an attention sub-network layer, and a fully connected feed-forward sub-network layer.
Because the oral expressions of the customers are often spoken, the words are not specialized, the problems of deletion, omission and the like exist, and the models cannot identify accurate words, the original words expressed by the customers need to be converted into standard characteristic words.
Specifically, preset tone words and preset nonsense words are removed, feature words are extracted, and repairing and labeling are carried out on the preprocessed text data to obtain target training vectors and corresponding target label data. And training the BilSTM-CRF model based on the target training vector and the corresponding target label data. And constructing a disabled word bank, and removing preset meaningless words such as preset mood words, addresses and the like in the preprocessed text data. Feature words are extracted based on the rules. And identifying keywords such as 'company', 'package', and the like to position the characteristic text. Noise reduction: and the service personnel performs sampling inspection on each text and repairs and marks. The construction of the corpus needs to process the feature words and the original text data to obtain target training vectors and target label data. And marking entities on the original text according to the characteristic words after the manual restoration, and marking non-entities as 'other', thereby finishing the manufacture of the training corpus. And then, reading the words of each original noun in the preprocessed text data according to the pre-trained word vector model, acquiring a word vector list, and inputting the word vector list into the model as an initialization value. Model training and prediction are trained using a sequence labeling model implemented on the basis of BilSTM-CRF using target training vectors and target label inputs.
In an embodiment of the present application, acquiring first text data includes:
(1) Communication session voice data is acquired.
For example, the communication session voice data is communication session voice data generated when a customer service performs a voice call.
(2) And carrying out voice recognition on the communication conversation voice data to obtain second text data.
And performing ASR on the communication dialogue voice data to obtain second text data.
(3) The first text data is determined from the second text data.
In a specific embodiment, the second text data is determined as the first text data.
In another specific embodiment, determining the first text data from the second text data may include:
(1) And judging the dialect type of the second text data according to the preset dialect dictionary.
Because there are a lot of dialects (such as cantonese) and colloquial expressions in the custom complaint text data set, it is necessary to map the contents expressed by the dialects into mandarin with the same meaning, and map words with different expression modes but the same meaning into the same word for uniform training.
Specifically, the dialect type of the second text data is first judged from a preset dialect dictionary.
(2) And determining a corresponding dialect Mandarin mapping dictionary according to the dialect type of the second text data.
(3) The second text data is mapped to third text data of a mandarin type according to a dialect mandarin mapping dictionary.
Specifically, the second text data is mapped to the mandarin type according to the dialect mandarin mapping dictionary. The dialect mandarin mapping dictionary is a dialect-mandarin mapping dictionary, and dialect vocabularies in the second text data are replaced by mandarin to obtain third text data of a mandarin type.
(4) The first text data is determined from the third text data.
In a specific embodiment, the third text data is determined as the first text data.
In another specific embodiment, determining the first text data from the third text data may include:
(1) And merging the synonyms in the third text data to obtain synonym merged text data.
Specifically, words with different expression modes but the same meaning are mapped to the same word for uniform training.
(2) And inputting the synonym merged text data into a target abstract generation model for dimensionality reduction to obtain first text data.
Because the original text data volume is large, text dimension reduction is needed to reduce the calculation amount. Specifically, the number of tags in the third text data is first reduced using text multi-layer convolution by fusing the measurement and the T5 model of the MLM algorithm. And after the synonym combination, inputting the third text data into a target abstract generation model for dimensionality reduction to obtain first text data.
S202, inputting the first text data into a BilSTM model to obtain a text sequence with a preset length.
Specifically, inputting the first text data into a BilSTM model to obtain a text sequence with a preset length, including:
(1) And eliminating the preset linguistic-qi words and the preset nonsense words of the first text data to obtain eliminated text data.
(2) Inputting the removed text data into a pre-trained character vector model, reading characters of the removed text data, and obtaining a character vector list.
(3) And inputting the word vector list into a BilSTM model to obtain a text sequence with a preset length.
S203, inputting the text sequence into a BiLstm-CRF model to obtain a first keyword list of the first text data;
s204, inputting the first text data into a preset keyword extraction model for keyword extraction, and obtaining a second keyword list of the first text data.
Keywords are words that can express the content of the document center, and are commonly used in computer systems to index the content features of articles, information retrieval, system aggregation for readers to review. Keyword extraction is a branch of the field of text mining, and is the fundamental work of text mining research such as text retrieval, document comparison, abstract generation, document classification and clustering.
From the perspective of the algorithm, the keyword extraction algorithm mainly has two categories: unsupervised keyword extraction method and supervised keyword extraction method.
The unsupervised keyword extraction method does not need manually labeled corpora, and certain methods are used for finding out more important words in the text as keywords to extract the keywords. The method comprises the steps of extracting candidate words, scoring each candidate word, and outputting the candidate words with the highest topK scores as keywords. According to different scoring strategies, different algorithms exist, such as TF-IDF, textRank, LDA and the like.
There are three main types of unsupervised keyword extraction methods: keyword extraction (TF, TF-IDF) based on statistical features; keyword extraction (PageRank, textRank) based on a word graph model; keyword extraction (LDA) based on topic model
The supervised keyword extraction method treats a keyword extraction process as a binary classification problem, extracts candidate words, then defines a label for each candidate word, whether the label is a keyword or not, and trains a keyword extraction classifier. When a new document comes, all candidate words are extracted, then the trained keyword extraction classifier is used for classifying all the candidate words, and finally the candidate words with the labels as the keywords are used as the keywords.
S205, judging whether the first keyword list and the second keyword list are the same.
S206, if the first keyword list and the second keyword list are different, updating the BiLSTM model and the BiLstm-CRF model by using the first text data; and if the first keyword list and the second keyword list are the same, determining the first keyword list of the first text data as the extracted target keyword list.
After the first keyword list is extracted, the first keyword list is compared with a second keyword list extracted through a single deep neural network, the matched target keyword list is used for subsequent engineering application, but due to the fact that the service and the scene of customer service are changed, accidental customer service events can occur, and a small number of unmatched keywords are generated. And the first text data corresponding to the part of keywords is input into the model again, the BiLSTM model and the BiLstm-CRF model are updated, small sample correction training is carried out, and the keyword list is optimized quickly.
Compared with the existing keyword extraction method, the method improves the accuracy of keyword extraction, and does not depend on a large amount of corpora of manually constructed subdivided fields. Compared with the prior research, the invention has the advantages that the improvement rate is 24.18 percent, 57.40 percent and 42.51 percent compared with the baseline model on the three indexes of accuracy rate, recall rate and F value.
In order to better implement the method for extracting a keyword based on small sample learning in the embodiment of the present application, on the basis of the method for extracting a keyword based on small sample learning, an embodiment of the present application further provides a device for extracting a keyword based on small sample learning, as shown in fig. 3, fig. 3 is a schematic structural diagram of an embodiment of the device for extracting a keyword based on small sample learning provided in the embodiment of the present application, and the device 400 for extracting a keyword based on small sample learning includes:
an acquisition unit 401 configured to acquire first text data;
a text obtaining unit 402, configured to input the first text data into a BiLSTM model, so as to obtain a text sequence with a preset length;
a first keyword extraction unit 403, configured to input the text sequence into a BiLstm-CRF model, to obtain a first keyword list of the first text data;
a second keyword extraction unit 404, configured to input the first text data into a preset keyword extraction model to perform keyword extraction, so as to obtain a second keyword list of the first text data;
a determining unit 405, configured to determine whether the first keyword list and the second keyword list are the same;
a determining unit 406, configured to update a BiLSTM model and a BiLSTM-CRF model by using the first text data if the first keyword list and the second keyword list are different; and if the first keyword list and the second keyword list are the same, determining the first keyword list of the first text data as an extracted target keyword list.
Further, the obtaining unit is configured to:
acquiring communication conversation voice data;
carrying out voice recognition on the communication conversation voice data to obtain second text data;
the first text data is determined from the second text data.
Further, the obtaining unit is configured to:
judging the dialect type of the second text data according to a preset dialect dictionary;
determining a corresponding dialect mandarin mapping dictionary according to the dialect type of the second text data;
mapping the second text data to third text data of a Mandarin type according to the dialect Mandarin mapping dictionary;
the first text data is determined from the third text data.
Further, the obtaining unit is configured to:
merging synonyms in the third text data to obtain synonym merged text data;
and inputting the synonym merged text data into a target abstract generation model for dimensionality reduction to obtain first text data.
Further, the target abstract generation model comprises an encoder sub-model and a decoder sub-model, and the network layer of the encoder sub-model comprises a multi-head self-attention mechanism sub-network layer and a fully-connected feedforward sub-network layer; the network layers of the decoder submodel include a multi-headed self-attention mechanism sub-network layer, an attention sub-network layer, and a fully connected feed-forward sub-network layer.
Further, the text obtaining unit is configured to include:
removing preset linguistic qi words and preset nonsense words of the first text data to obtain removed text data;
inputting the removed text data into a pre-trained character vector model, reading characters of the removed text data, and acquiring a character vector list;
and inputting the word vector list into a BilSTM model to obtain a text sequence with a preset length.
Further, the obtaining unit is configured to:
performing text preprocessing on the training text set to obtain preprocessed text data;
performing preset linguistic and semantic word elimination, characteristic word extraction and restoration labeling on the preprocessed text data to obtain a target training vector and corresponding target label data;
and training the BilSTM-CRF model based on the target training vector and the corresponding target label data.
The embodiment of the application also provides electronic equipment, which integrates any keyword extraction device based on small sample learning provided by the embodiment of the application. As shown in fig. 4, a schematic structural diagram of an electronic device according to an embodiment of the present application is shown, specifically:
the electronic device may include components such as a processor 501 of one or more processing cores, memory 502 of one or more computer-readable storage media, a power supply 503, and an input unit 504. Those skilled in the art will appreciate that the electronic device structures shown in the figures do not constitute limitations on the electronic device, and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components. Wherein:
the processor 501 is a control center of the electronic device, connects various parts of the whole electronic device by using various interfaces and lines, and performs various functions of the electronic device and processes data by running or executing software programs and/or modules stored in the memory 502 and calling data stored in the memory 502, thereby performing overall monitoring of the electronic device. Optionally, processor 501 may include one or more processing cores; preferably, the processor 501 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 501.
The memory 502 may be used to store software programs and modules, and the processor 501 executes various functional applications and data processing by operating the software programs and modules stored in the memory 502. The memory 502 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data created according to use of the electronic device, and the like. Further, the memory 502 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 502 may also include a memory controller to provide the processor 501 with access to the memory 502.
The electronic device further comprises a power supply 503 for supplying power to each component, and preferably, the power supply 503 may be logically connected to the processor 501 through a power management system, so that functions of managing charging, discharging, power consumption, and the like are realized through the power management system. The power supply 503 may also include any component of one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.
The electronic device may also include an input unit 504, where the input unit 504 may be used to receive input numeric or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.
Although not shown, the electronic device may further include a display unit and the like, which are not described in detail herein. Specifically, in this embodiment, the processor 501 in the electronic device loads the executable file corresponding to the process of one or more application programs into the memory 502 according to the following instructions, and the processor 501 runs the application program stored in the memory 502, so as to implement various functions as follows:
acquiring first text data; inputting the first text data into a BilSTM model to obtain a text sequence with a preset length; inputting the text sequence into a BiLstm-CRF model to obtain a first keyword list of the first text data; inputting the first text data into a preset keyword extraction model for keyword extraction to obtain a second keyword list of the first text data; judging whether the first keyword list and the second keyword list are the same; if the first keyword list is different from the second keyword list, updating a BiLSTM model and a BiLstm-CRF model by using the first text data; and if the first keyword list and the second keyword list are the same, determining the first keyword list of the first text data as the extracted target keyword list. The method and the device can improve the accuracy of the keyword extraction method based on small sample learning.
It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions or by associated hardware controlled by the instructions, which may be stored in a computer readable storage medium and loaded and executed by a processor.
To this end, an embodiment of the present application provides a computer-readable storage medium, which may include: read Only Memory (ROM), random Access Memory (RAM), magnetic or optical disks, and the like. The computer program is loaded by a processor to execute the steps in any of the small sample learning-based keyword extraction methods provided by the embodiments of the present application. For example, the computer program may be loaded by a processor to perform the steps of:
acquiring first text data; inputting the first text data into a BilSTM model to obtain a text sequence with a preset length; inputting the text sequence into a BiLstm-CRF model to obtain a first keyword list of the first text data; inputting the first text data into a preset keyword extraction model for keyword extraction to obtain a second keyword list of the first text data; judging whether the first keyword list and the second keyword list are the same; if the first keyword list is different from the second keyword list, updating a BiLSTM model and a BiLstm-CRF model by using the first text data; and if the first keyword list and the second keyword list are the same, determining the first keyword list of the first text data as the extracted target keyword list. The method and the device can improve the accuracy of the keyword extraction method based on small sample learning.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and parts that are not described in detail in a certain embodiment may refer to the above detailed descriptions of other embodiments, and are not described herein again.
In specific implementation, each unit or structure may be implemented as an independent entity, or may be combined arbitrarily to be implemented as the same entity or several entities, and specific implementation of each unit or structure may refer to the foregoing method embodiment, which is not described herein again.
The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.
The method and the device for extracting keywords based on small sample learning provided by the embodiment of the application are introduced in detail, a specific example is applied to explain the principle and the implementation mode of the application, and the description of the embodiment is only used for helping to understand the method and the core idea of the application; meanwhile, for those skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims (10)

1. A keyword extraction method based on small sample learning is characterized by comprising the following steps:
acquiring first text data;
inputting the first text data into a BilSTM model to obtain a text sequence with a preset length;
inputting the text sequence into a BiLstm-CRF model to obtain a first keyword list of the first text data;
inputting the first text data into a preset keyword extraction model for keyword extraction to obtain a second keyword list of the first text data;
judging whether the first keyword list and the second keyword list are the same;
if the first keyword list is different from the second keyword list, updating a BiLSTM model and a BiLstm-CRF model by using the first text data; and if the first keyword list and the second keyword list are the same, determining the first keyword list of the first text data as an extracted target keyword list.
2. The method for extracting keywords based on small sample learning as claimed in claim 1, wherein the obtaining of the first text data comprises:
acquiring communication conversation voice data;
carrying out voice recognition on the communication conversation voice data to obtain second text data;
the first text data is determined from the second text data.
3. The method for extracting keywords based on small sample learning as claimed in claim 2, wherein the determining the first text data according to the second text data comprises:
judging the dialect type of the second text data according to a preset dialect dictionary;
determining a corresponding dialect mandarin mapping dictionary according to the dialect type of the second text data;
mapping the second text data to third text data of a Mandarin type according to the dialect Mandarin mapping dictionary;
the first text data is determined from the third text data.
4. The method for extracting keywords based on small sample learning as claimed in claim 3, wherein the determining the first text data according to the third text data comprises:
merging synonyms in the third text data to obtain synonym merged text data;
and inputting the synonym merged text data into a target abstract generation model for dimensionality reduction to obtain first text data.
5. The small sample learning-based keyword extraction method according to claim 4, wherein the target abstract generation model comprises an encoder sub-model and a decoder sub-model, and the network layers of the encoder sub-model comprise a multi-head self-attention mechanism sub-network layer and a fully-connected feedforward sub-network layer; the network layers of the decoder sub-model include a multi-headed self-attention mechanism sub-network layer, an attention sub-network layer, and a fully-connected feed-forward sub-network layer.
6. The method for extracting keywords based on small sample learning as claimed in claim 1, wherein the inputting the first text data into a BilSTM model to obtain a text sequence with a preset length comprises:
removing preset linguistic and semantic words and preset nonsense words of the first text data to obtain removed text data;
inputting the removed text data into a pre-trained character vector model, reading characters of the removed text data, and acquiring a character vector list;
and inputting the word vector list into a BilSTM model to obtain a text sequence with a preset length.
7. The method for extracting keywords based on small sample learning as claimed in claim 1, wherein the obtaining the first text data comprises:
performing text preprocessing on the training text set to obtain preprocessed text data;
performing preset linguistic and semantic word elimination, characteristic word extraction and restoration labeling on the preprocessed text data to obtain a target training vector and corresponding target label data;
and training the BilSTM-CRF model based on the target training vector and the corresponding target label data.
8. A keyword extraction apparatus based on small sample learning is characterized in that the keyword extraction apparatus based on small sample learning includes:
an acquisition unit configured to acquire first text data;
the text acquisition unit is used for inputting the first text data into the BilSTM model to obtain a text sequence with a preset length;
the first keyword extraction unit is used for inputting the text sequence into a BiLstm-CRF model to obtain a first keyword list of the first text data;
the second keyword extraction unit is used for inputting the first text data into a preset keyword extraction model to extract keywords, and a second keyword list of the first text data is obtained;
a judging unit configured to judge whether the first keyword list and the second keyword list are the same;
a determining unit, which updates a BilStm model and a BilStm-CRF model by using the first text data if the first keyword list and the second keyword list are different; and if the first keyword list and the second keyword list are the same, determining the first keyword list of the first text data as an extracted target keyword list.
9. An electronic device, characterized in that the electronic device comprises:
one or more processors;
a memory; and
one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the processor to implement the small sample learning based keyword extraction method of any one of claims 1 to 7.
10. A computer-readable storage medium, having stored thereon a computer program which is loaded by a processor to perform the steps of the small sample learning based keyword extraction method of any one of claims 1 to 7.
CN202211459952.0A 2022-11-17 2022-11-17 Keyword extraction method and device based on small sample learning Active CN115827815B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211459952.0A CN115827815B (en) 2022-11-17 2022-11-17 Keyword extraction method and device based on small sample learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211459952.0A CN115827815B (en) 2022-11-17 2022-11-17 Keyword extraction method and device based on small sample learning

Publications (2)

Publication Number Publication Date
CN115827815A true CN115827815A (en) 2023-03-21
CN115827815B CN115827815B (en) 2023-12-29

Family

ID=85529876

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211459952.0A Active CN115827815B (en) 2022-11-17 2022-11-17 Keyword extraction method and device based on small sample learning

Country Status (1)

Country Link
CN (1) CN115827815B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116894092A (en) * 2023-09-11 2023-10-17 中移(苏州)软件技术有限公司 Text processing method, text processing device, electronic equipment and readable storage medium

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109829058A (en) * 2019-01-17 2019-05-31 西北大学 A kind of classifying identification method improving accent recognition accuracy rate based on multi-task learning
CN110782881A (en) * 2019-10-25 2020-02-11 四川长虹电器股份有限公司 Video entity error correction method after speech recognition and entity recognition
CN111160017A (en) * 2019-12-12 2020-05-15 北京文思海辉金信软件有限公司 Keyword extraction method, phonetics scoring method and phonetics recommendation method
CN111737979A (en) * 2020-06-18 2020-10-02 龙马智芯(珠海横琴)科技有限公司 Keyword correction method, device, correction equipment and storage medium for voice text
CN111985214A (en) * 2020-08-19 2020-11-24 四川长虹电器股份有限公司 Human-computer interaction negative emotion analysis method based on bilstm and attention
CN112183099A (en) * 2020-10-09 2021-01-05 上海明略人工智能(集团)有限公司 Named entity identification method and system based on semi-supervised small sample extension
CN113268995A (en) * 2021-07-19 2021-08-17 北京邮电大学 Chinese academy keyword extraction method, device and storage medium
CN113536735A (en) * 2021-09-17 2021-10-22 杭州费尔斯通科技有限公司 Text marking method, system and storage medium based on keywords
CN113761903A (en) * 2020-06-05 2021-12-07 国家计算机网络与信息安全管理中心 Text screening method for high-volume high-noise spoken short text
CN113889115A (en) * 2021-09-29 2022-01-04 深圳市易平方网络科技有限公司 Dialect commentary method based on voice model and related device
CN114091454A (en) * 2021-11-29 2022-02-25 重庆市地理信息和遥感应用中心 Method for extracting place name information and positioning space in internet text
CN114328930A (en) * 2021-12-31 2022-04-12 成都思维世纪科技有限责任公司 Text classification method and system based on entity extraction
WO2022134575A1 (en) * 2020-12-23 2022-06-30 深圳壹账通智能科技有限公司 Service keyword extraction method, apparatus, and device, and storage medium

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109829058A (en) * 2019-01-17 2019-05-31 西北大学 A kind of classifying identification method improving accent recognition accuracy rate based on multi-task learning
CN110782881A (en) * 2019-10-25 2020-02-11 四川长虹电器股份有限公司 Video entity error correction method after speech recognition and entity recognition
CN111160017A (en) * 2019-12-12 2020-05-15 北京文思海辉金信软件有限公司 Keyword extraction method, phonetics scoring method and phonetics recommendation method
CN113761903A (en) * 2020-06-05 2021-12-07 国家计算机网络与信息安全管理中心 Text screening method for high-volume high-noise spoken short text
CN111737979A (en) * 2020-06-18 2020-10-02 龙马智芯(珠海横琴)科技有限公司 Keyword correction method, device, correction equipment and storage medium for voice text
CN111985214A (en) * 2020-08-19 2020-11-24 四川长虹电器股份有限公司 Human-computer interaction negative emotion analysis method based on bilstm and attention
CN112183099A (en) * 2020-10-09 2021-01-05 上海明略人工智能(集团)有限公司 Named entity identification method and system based on semi-supervised small sample extension
WO2022134575A1 (en) * 2020-12-23 2022-06-30 深圳壹账通智能科技有限公司 Service keyword extraction method, apparatus, and device, and storage medium
CN113268995A (en) * 2021-07-19 2021-08-17 北京邮电大学 Chinese academy keyword extraction method, device and storage medium
CN113536735A (en) * 2021-09-17 2021-10-22 杭州费尔斯通科技有限公司 Text marking method, system and storage medium based on keywords
CN113889115A (en) * 2021-09-29 2022-01-04 深圳市易平方网络科技有限公司 Dialect commentary method based on voice model and related device
CN114091454A (en) * 2021-11-29 2022-02-25 重庆市地理信息和遥感应用中心 Method for extracting place name information and positioning space in internet text
CN114328930A (en) * 2021-12-31 2022-04-12 成都思维世纪科技有限责任公司 Text classification method and system based on entity extraction

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
DAN FENG ET AL.: "A small samples training framework for deep Learning-based automatic information extraction: Case study of construction accident news reports analysis", 《ADVANCED ENGINEERING INFORMATICS》, pages 1 - 13 *
JIANFENG DENG ET AL.: "Attention-based BiLSTM fused CNN with gating mechanism model for Chinese long text classification", 《COMPUTER SPEECH & LANGUAGE》, pages 1 - 12 *
叶良攀: "基于BiLSTM的铁路调度语音识别***研究", 《中国优秀硕士学位论文全文数据库 工程科技II辑》, pages 033 - 605 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116894092A (en) * 2023-09-11 2023-10-17 中移(苏州)软件技术有限公司 Text processing method, text processing device, electronic equipment and readable storage medium
CN116894092B (en) * 2023-09-11 2024-01-26 中移(苏州)软件技术有限公司 Text processing method, text processing device, electronic equipment and readable storage medium

Also Published As

Publication number Publication date
CN115827815B (en) 2023-12-29

Similar Documents

Publication Publication Date Title
WO2014033799A1 (en) Word meaning relationship extraction device
CN110321563B (en) Text emotion analysis method based on hybrid supervision model
CN111931477B (en) Text matching method and device, electronic equipment and storage medium
CN111950283B (en) Chinese word segmentation and named entity recognition system for large-scale medical text mining
CN113177412A (en) Named entity identification method and system based on bert, electronic equipment and storage medium
WO2024067276A1 (en) Video tag determination method and apparatus, device and medium
CN112699686A (en) Semantic understanding method, device, equipment and medium based on task type dialog system
CN113761377A (en) Attention mechanism multi-feature fusion-based false information detection method and device, electronic equipment and storage medium
CN115827815B (en) Keyword extraction method and device based on small sample learning
CN115759119A (en) Financial text emotion analysis method, system, medium and equipment
CN117217277A (en) Pre-training method, device, equipment, storage medium and product of language model
Jang et al. A novel density-based clustering method using word embedding features for dialogue intention recognition
CN115329765A (en) Method and device for identifying risks of listed enterprises, electronic equipment and storage medium
CN115017884A (en) Text parallel sentence pair extraction method based on image-text multi-mode gating enhancement
CN114328800A (en) Text processing method and device, electronic equipment and computer readable storage medium
CN111723583B (en) Statement processing method, device, equipment and storage medium based on intention role
CN110888983B (en) Positive and negative emotion analysis method, terminal equipment and storage medium
CN112906368A (en) Industry text increment method, related device and computer program product
Ding et al. Event extraction with deep contextualized word representation and multi-attention layer
WO2023087935A1 (en) Coreference resolution method, and training method and apparatus for coreference resolution model
CN110705274A (en) Fusion type word meaning embedding method based on real-time learning
CN115238696A (en) Chinese named entity recognition method, electronic equipment and storage medium
CN115357697A (en) Data processing method, device, terminal equipment and storage medium
CN114676699A (en) Entity emotion analysis method and device, computer equipment and storage medium
CN114780673A (en) Scientific and technological achievement management method and scientific and technological achievement management platform based on field matching

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant