CN115017870A - Closed-loop dialect expanding writing method and device, computer equipment and storage medium - Google Patents

Closed-loop dialect expanding writing method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN115017870A
CN115017870A CN202210603422.2A CN202210603422A CN115017870A CN 115017870 A CN115017870 A CN 115017870A CN 202210603422 A CN202210603422 A CN 202210603422A CN 115017870 A CN115017870 A CN 115017870A
Authority
CN
China
Prior art keywords
text
writing
augmentation
keyword
dialect
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202210603422.2A
Other languages
Chinese (zh)
Inventor
于凤英
王健宗
程宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202210603422.2A priority Critical patent/CN115017870A/en
Publication of CN115017870A publication Critical patent/CN115017870A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a closed-loop dialect extending writing method, a closed-loop dialect extending writing device, computer equipment and a storage medium. The method comprises the following steps: acquiring a language and technology text set of a labeled target intention, and extracting a first keyword set of the target intention from the language and technology text set; performing a linguistic diffusion writing by adopting a trained mT5 model based on the linguistic text set and the first keyword set to generate a first diffusion linguistic text set of the target intention; extracting keywords from the first augmented typography text set again to obtain a second keyword set of the target intention; and performing the lexical augmentation writing again based on the first augmentation writing text set and the second keyword set to obtain the lexical augmentation writing result of the target intention. The invention enriches the richness of the expanding and writing technique, and adopts a closed-loop multi-cycle expanding and writing mode, thereby improving the data volume of the expanding and writing technique and enriching the diversity of the expanding and writing technique.

Description

Closed-loop dialect expanding writing method and device, computer equipment and storage medium
Technical Field
The present invention relates to the field of natural language processing technologies, and in particular, to a closed-loop dialect extending method, an apparatus, a computer device, and a storage medium.
Background
Natural Language Processing (NLP) is a technology for performing interactive communication with a machine using Natural Language used for human communication. At present, the natural language processing model needs to depend on data labeling, and the labeling of the data often needs to consume a large amount of human resources. In many cases, the online service in the new scene is difficult to provide a certain amount of labeled data, which may cause the problem of cold start of online in the new scene. To address this problem, a word-art augmentation system has been developed that gives a word and then augments a series of words that have similar semantics. The defects of the expanding writing mode are as follows: because the expanding and writing technique is semantically similar to the existing labeling technique, the diversity of expanding and writing of the speech technique is limited, and the intention identification effect of a new scene is poor due to the insufficient diversity of the expanding and writing technique; in addition, because the word expanding and writing technology has a certain smoothness problem, the expanded and written sentences need to be filtered manually, and a large amount of human resources need to be consumed.
Disclosure of Invention
The invention provides a closed-loop dialect extending writing method, a closed-loop dialect extending writing device, computer equipment and a storage medium, and aims to solve the technical problems that the existing dialect extending writing system is insufficient in extending writing technology diversity and requires manual filtering of extending writing technology and the like.
In order to solve the technical problems, the technical scheme adopted by the invention is as follows:
a closed-loop dialect augmentation method, comprising:
acquiring a language and technology text set of a labeled target intention, and extracting a first keyword set of the target intention from the language and technology text set;
performing a linguistic diffusion writing by adopting a trained mT5 model based on the linguistic text set and the first keyword set to generate a first diffusion linguistic text set of the target intention;
extracting keywords from the first augmented typography text set again to obtain a second keyword set of the target intention;
and performing the lexical augmentation writing again based on the first augmentation writing text set and the second keyword set to obtain the lexical augmentation writing result of the target intention.
The technical scheme adopted by the embodiment of the invention also comprises the following steps: the extracting the first set of keywords of the target intent from the set of verbal texts comprises:
extracting keywords of a target intention from the linguistic text set by adopting a TextRank algorithm;
the method for extracting the keywords of the target intention from the linguistic text set by adopting the TextRank algorithm specifically comprises the following steps: setting a sliding window with the length of m, regarding all words in the same window as adjacent nodes of the nodes, constructing an undirected graph of the words, regarding the co-occurrence of different word pairs as the weight of the undirected graph edge, and extracting keywords based on the undirected graph.
The technical scheme adopted by the embodiment of the invention also comprises the following steps: the extracting the keywords of the target intention from the linguistic text set by adopting a TextRank algorithm comprises the following steps:
performing sentence breaking on each phonetics text according to punctuations in the phonetics texts;
segmenting each sentence after the sentence break, removing stop words in each sentence, performing part-of-speech tagging on each word, reserving words with specified parts-of-speech, and generating a candidate keyword set;
constructing an undirected graph G (V, E) of the words based on the candidate keyword set, wherein V is a node set, and E is an edge set;
iteratively calculating a Rank value of each node by utilizing a PageRank algorithm based on the undirected graph;
sorting the Rank values of all the nodes in a descending order, and selecting the top M candidate keywords as final keywords according to a sorting result;
and marking the selected keywords in the linguistic text, judging whether at least two keywords exist in the marked linguistic text to form an adjacent phrase, and if so, combining the at least two keywords forming the adjacent phrase into one keyword to obtain a first keyword set of the target intention.
The technical scheme adopted by the embodiment of the invention also comprises the following steps: the mT5 model includes a linguistic text-based augmentation model and a keyword-based augmentation model, and the performing the linguistic augmentation with the trained mT5 model based on the set of linguistic text and the first set of keywords includes:
inputting the set of the language-art text into a trained language-art-text-based extension model to generate an extension language with similar semantics with the language-art text;
and inputting the first keyword set into a trained keyword-based augmentation model to generate augmentation jargon corresponding to the target intention.
The technical scheme adopted by the embodiment of the invention also comprises the following steps: the training process of the language text-based augmentation model comprises the following steps:
calculating the semantic similarity between the conversational texts under the same intention through a semantic similarity model RE2, filtering out conversational pairs with larger semantic differences, and generating a training set of 'tagging conversational- > expanding and writing conversational techniques'; importing the training set into a publicly pre-trained mt5-base for retraining to generate a trained expanded writing model based on a dialect text;
the training process of the keyword-based augmentation writing model comprises the following steps: constructing a training set of 'keyword \ phrase- > word expansion and writing technology', importing the training set into a publicly pre-trained mt5-base for retraining, and outputting a trained keyword-based expansion and writing model.
The technical scheme adopted by the embodiment of the invention also comprises the following steps: before the re-performing the keyword extraction on the first set of augmented text, the method further comprises:
evaluating the smoothness of all the expanded transcription texts in the first expanded transcription text set by adopting an evaluation model based on BERT + MLM;
and filtering the extended script texts with the smoothness lower than a set threshold value.
The technical scheme adopted by the embodiment of the invention also comprises the following steps: the evaluating the smoothness of all the extensible language texts in the first extensible language text set by adopting a BERT + MLM-based evaluation model comprises the following steps:
for each word in the first set of augmented-writings text, recursively covering each word in the augmented-writings text by adopting a BERT-based MLM model, predicting each covered word, considering the prediction to be correct if the prediction result is the same as the word in the augmented-writings text, and considering the prediction to be wrong if the prediction result is different from the word in the augmented-writings text;
counting and predicting the proportion of the correct word number to the total word number in the expanded word text as a first metric value for measuring the smoothness of the expanded word text;
predicting the position of an unopened parallel word in the extended transcription art text by adopting an error detection model based on BERT + CRF;
counting the proportion of the total number of the non-sequential words to the total number of the words of the expanded transcription text, and taking the counted proportion as a second metric value for measuring the sequential degree of the expanded transcription text;
carrying out weighted fusion on the first metric value and the second metric value to obtain a threshold value for measuring the smoothness of the phonetic text;
and filtering out the extended word text with the currency degree lower than the threshold value in the first extended word text set.
The embodiment of the invention adopts another technical scheme as follows: a closed-loop dialogues augmentation apparatus comprising:
the first keyword extraction module: the method comprises the steps of obtaining a linguistic text set of labeled target intentions, and extracting a first keyword set of the target intentions from the linguistic text set;
the first language expansion module: the first expanded lexical text set is used for carrying out lexical expansion by adopting a trained mT5 model based on the lexical text set and the first keyword set, and the first expanded lexical text set of the target intention is generated;
the second keyword extraction module: the second keyword set is used for extracting keywords from the first augmented typography text set again to obtain the target intention;
the second talk expansion module: and the system is used for re-performing the linguistic augmentation writing based on the first augmentation writing text set and the second keyword set to obtain the linguistic augmentation writing result of the target intention.
The embodiment of the invention adopts another technical scheme that: a computer device, the computer device comprising:
a memory storing executable program code;
a processor coupled to the memory;
the processor calls the executable program code stored in the memory to perform the closed-loop dialect write method of any of claims 1-7.
The embodiment of the invention adopts another technical scheme that: a storage medium storing program instructions executable by a processor to perform the closed-loop dialect writing method described above.
The closed-loop jargon extending writing method, the closed-loop jargon extending writing device, the computer equipment and the storage medium adopt a mode of combining based on the label text and based on the key words to extend the jargon, and richness of the extending writing jargon is enriched. The extended-writing word text with the smoothness lower than the set threshold is filtered by combining the MLM-based method and the BERT + CRF-based method, so that the accuracy of assessment of the smoothness of the word operation is improved, and the cost of manual word filtering is saved. Meanwhile, the invention adopts a closed-loop multi-cycle expanding-writing mode, improves the data volume of expanding-writing words and enriches the diversity of expanding-writing words.
Drawings
FIG. 1 is a schematic flow chart of a closed-loop dialect write method according to a first embodiment of the present invention;
FIG. 2 is a schematic flow chart of a closed-loop dialect writing method according to a second embodiment of the present invention;
FIG. 3 is a flowchart illustrating an implementation of extracting keywords from a linguistic text set by using a TextRank algorithm according to an embodiment of the present invention;
FIG. 4 is a flow chart of evaluating compliance of augmented phone text using a BERT + MLM based evaluation model in accordance with an embodiment of the present invention;
FIG. 5 is a schematic diagram of a closed-loop dialogical augmentation apparatus according to an embodiment of the present invention;
FIG. 6 is a schematic diagram of a computer device configuration according to an embodiment of the present invention;
FIG. 7 is a schematic diagram of a storage medium structure according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The terms "first", "second" and "third" in the present invention are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first," "second," or "third" may explicitly or implicitly include at least one of the feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise. All directional indicators such as up, down, left, right, front, and rear … … in the embodiments of the present invention are only used to explain the relative position relationship between the components, the movement of the components in a particular posture (as shown in the drawings), and if the particular posture changes, the directional indicator changes accordingly. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the invention. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.
Fig. 1 is a schematic flow chart of a closed-loop dialect writing method according to a first embodiment of the present invention. The closed-loop dialect augmentation method of the first embodiment of the present invention includes the steps of:
s100: acquiring a language and technology text set of the labeled target intention, and extracting a first keyword set of the target intention from the language and technology text set;
in the step, keywords are extracted from a linguistic text set by adopting a TextRank algorithm, wherein the TextRank algorithm is a sorting algorithm for extracting the keywords based on a graph, and the algorithm is unsupervised learning and is used for extracting the keywords by utilizing co-occurrence information (namely semantics) among words in the linguistic text. The method specifically comprises the following steps: setting a sliding window with the length of m (the length of the sliding window is usually set to be 5, and the length can be specifically set according to an application scene), regarding all words in the same window as adjacent nodes of the nodes, constructing an undirected graph of the words, wherein different word pairs have different co-occurrence, regarding the co-occurrence as a weight of an undirected graph edge, and then extracting keywords based on the undirected graph of the words.
S110: performing the linguistic diffusion writing by adopting a trained mT5 model based on the linguistic text set and the first keyword set to generate a first diffusion linguistic text set of the target intention;
in this step, the mT5 model is a pre-training model obtained by continuously training the T5 model on the multilingual data set C4, the mT5 model adopts an Encoder-Decoder model architecture, and both the Encoder and the Decoder are transform structures.
In the embodiment of the invention, the performing of the language art expansion writing by adopting the mT5 model comprises performing the language art expansion writing based on the language art text set and performing the language art expansion writing based on the first keyword set, for the language art expansion writing based on the language art text set, expanding and writing the language art with the semantic similarity with the labeled language art text by utilizing the semantic similarity model RE2, and for the language art expansion writing based on the first keyword set, expanding and writing the language art which is consistent with the target intention according to the extracted keyword set.
S120: extracting keywords from the first augmented typography text set again to obtain a second keyword set of the target intention;
s130: and performing the lexical augmentation writing again based on the first augmentation writing lexical text set and the second keyword set to obtain a lexical augmentation writing result of the target intention.
Based on the above, the closed-loop dialect extending writing method according to the first embodiment of the present invention adopts a mode of combining the label text and the key word to extend the dialect, so as to enrich the richness of the extending writing, and adopts the closed-loop multi-cycle extending writing mode, so as to improve the data volume of the extending writing and enrich the diversity of the extending writing.
Fig. 2 is a schematic flow chart of a closed-loop dialect writing method according to a second embodiment of the present invention. The closed-loop dialect augmentation method of the second embodiment of the present invention includes the steps of:
s200: acquiring a marked conversational text set of a target intention;
in this step, the dialect text set includes different dialect expression modes under the same user intention, for example: the user intentions are: buying insurance, its different dialectical expression modes include: i want to buy insurance, buy a share of insurance, etc.
S210: extracting keywords from the linguistic text set by adopting a TextRank algorithm to generate a first keyword set of the target intention;
in this step, the TextRank algorithm is a ranking algorithm for keyword extraction based on a graph, and the algorithm is unsupervised learning and performs keyword extraction by using co-occurrence information (i.e., semantics) between words in a formal text. The TextRank algorithm specifically comprises the following steps: setting a sliding window with the length of m (the length of the sliding window is usually set to be 5, and the length can be specifically set according to an application scene), regarding all words in the same window as adjacent nodes of the nodes, constructing an undirected graph of the words, wherein different word pairs have different co-occurrence, regarding the co-occurrence as a weight of an undirected graph edge, and then extracting keywords based on the undirected graph of the words. The keyword extraction formula based on the undirected graph of the words is specifically as follows:
Figure BDA0003670501430000091
wherein WS (V) i ) Represents a node V i The higher the importance degree is, the node V i The higher the probability of being a keyword. In (V) i ) Representation and node V i A set of nodes of connected edges. Out (V) j ) Representation and node V j A set of nodes of connected edges. d is the damping coefficient, typically 0.85.
Further, please refer to fig. 3, which is a flowchart illustrating an implementation of extracting keywords from a linguistic text set by using a TextRank algorithm according to an embodiment of the present invention, and the method specifically includes the following steps:
s211: performing sentence breaking on each tactical text according to punctuations in the tactical text to obtain a certain number of sentences; wherein punctuation symbols include, but are not limited to? In a similar manner to that of. | The! Etc.;
s212: segmenting each sentence after sentence break, removing stop words in each sentence, performing part-of-speech tagging on each word, reserving words with specified parts-of-speech such as nouns and verbs, and generating a candidate keyword set;
s213: constructing an undirected graph G (V, E) of the words based on the candidate keyword set, wherein V is a node set (namely the candidate keyword set), and E is an edge set; the method comprises the steps that edges between any two nodes are constructed by setting a sliding window and a co-occurrence relation, the edges exist between the two nodes, and the window continuously slides from beginning to end only when the two nodes co-occur in a window with the length of m;
s214: iteratively calculating the Rank value of each node according to a PageRank algorithm until convergence; wherein, the Rank value is 1/N, and N is the number of nodes;
the PageRank algorithm defines a random walk model, namely a first-order Markov chain, on an undirected graph, and describes the behavior of random walkers in randomly accessing various nodes along the graph. Under a certain condition, the probability of accessing each node under the limit condition converges to stable distribution, at the moment, the stable probability value of each node is the Rank value of the node, and the higher the Rank value is, the higher the importance degree of the node is.
S215: performing descending order arrangement on the Rank values of all the nodes, and selecting the first M candidate keywords as final keywords according to an ordering result;
s216, marking the selected keywords in the linguistic and academic text, judging whether at least two keywords exist in the marked linguistic and academic text to form an adjacent phrase, and if so, combining the at least two keywords forming the adjacent phrase into one keyword to obtain a final first keyword set of the target intention;
for example, if two adjacent keywords of "safe and insurance" marked in the dialect text exist, the two adjacent keywords of "safe and insurance" are combined into one keyword of "safe and insurance", which is beneficial to improving the accuracy of subsequent dialect augmentation.
S220: performing the linguistic diffusion writing by adopting a trained mT5 model based on the linguistic text set and the first keyword set to generate a first diffusion linguistic text set of the target intention;
in this step, the mT5 model is a pre-training model obtained by continuously training the T5 model on the multilingual data set C4, the mT5 model adopts an Encoder-Decoder model architecture, and both the Encoder and the Decoder are transform structures.
In the embodiment of the invention, the performing of the phonetics expansion by adopting the mT5 model comprises the phonetics expansion based on the phonetics text set and the phonetics expansion based on the first keyword set, for the phonetics expansion based on the phonetics text set, the phonetics with semantic similarity model RE2 is expanded to write out the phonetics with semantic similarity to the labeled phonetics text, and for the phonetics expansion based on the first keyword set, the phonetics conforming to the target intention is expanded according to the extracted keyword set.
Based on the above, the training process of the mT5 model of the embodiment of the present invention includes two parts: the first part is: training an augmented writing model based on the phonetics text; first, construct a training set of "markup language- > augment writing language": calculating the semantic similarity between the linguistic texts under the same intention through a semantic similarity model RE2, filtering out the linguistic pairs with too large semantic differences, and generating a training set of 'sensor 1 sensor 2'; then, a training set of 'sensor 1 sensor 2' is imported into publicly pre-trained mt5-base for retraining, and a trained expanding and writing model mt 5-base-sensor-model based on a phonetics text is generated; when the expanded language is used for expanding the language, the sensor 1 is only needed to be input into the model mt 5-base-sensor-model, so that the sensor 2 can be automatically generated, the sensor 3 can be automatically generated according to the sensor 2, and the expanded language text can be generated repeatedly. For example, the sensor 1 for the model input is: safety insurance is too expensive, and the automatically generated sensor 2 is: insurance cost of the safe company is very expensive, and the content 3 automatically generated according to the content 2 is as follows: the price of the safe insurance is expensive.
The second part is as follows: training a keyword-based expansion model; firstly, constructing a training set of 'keyword/phrase- > expanding and writing words', and extracting keyword sets keywords through TextRank for any word text sensor under the target intention to generate a training set of 'sensor keywords'; importing a training set of 'sense keywords' into mt5-base which is publicly and pre-trained for retraining, and outputting a corresponding keyword-based expansion writing model (mt 5-predicting each covered word, wherein if the prediction result is the same as the word in the expansion writing text, the prediction is correct, and if the prediction result is different from the word in the expansion writing text, the prediction is wrong;
s232: counting and predicting the proportion of the correct word number to the total word number in the expanded word text as a first metric value for measuring the smoothness of the expanded word text;
s233: predicting the position of the character which is not in the same order in the extended transcription art text by adopting an error detection model based on BERT + CRF, marking the corresponding position of which the prediction result is in the same order as O, and marking the corresponding position of which the prediction result is in the same order as B;
the error detection model based on BERT + CRF adopts a training set of 'correct/error dialect- > BIO labeling result', and labels a model aiming at a training sequence of a dialect text, so that a labeling result corresponding to each word in the dialect text is predicted, the correct word is labeled as O, the wrong word is labeled as B, the middle of the wrong word is labeled as I, and the error detection model based on BERT + CRF is obtained through training.
S234: counting the proportion of the number of the words marked as B to the total number of the words of the augmented script text, and taking the counted proportion as a second metric value for measuring the smoothness of the augmented script text;
s235: carrying out weighted fusion on the first metric value and the second metric value to obtain a threshold value for measuring the smoothness of the dialect text;
s236: and filtering out the extended language texts with the currency degree lower than the threshold value in the first extended language text set.
Based on the above, the embodiment of the invention adopts two filtering modes based on MLM and BERT + CRF to filter the smoothness of the expanded transcription, thereby improving the accuracy of assessment of the smoothness of the transcription and saving the cost of manual filtering the transcription.
S240: carrying out keyword extraction again on the filtered first word-expanding text set by adopting a TextRank algorithm to obtain a second keyword set of the target intention;
in this step, the extraction process of the second keyword set is the same as that of the first keyword set, and is not repeated here to avoid redundancy.
S250: based on the filtered first expanded writing language text set and the second keyword set, adopting a mT5 model to perform expanded writing language again to obtain a final expanded writing language result;
in this step, the dialect write process is the same as in S220, and in order to avoid redundancy, the description thereof will not be repeated. It can be understood that the key word extraction and the dialect expanding writing can be circulated for many times in a closed loop mode according to the application scene, so that the number of the expanding writing operations is increased, the diversity of the expanding writing operations is enriched, the generalization capability of the model is improved, and the accuracy of intention identification is improved.
Based on the above, the closed-loop dialect augmentation writing method according to the second embodiment of the present invention adopts a method based on the combination of the label text and the keyword to augment the dialect, so as to enrich the richness of augmentation writing. The extended writing language text with the smoothness lower than the set threshold is filtered by combining the MLM-based mode and the BERT + CRF-based mode, so that the accuracy of the evaluation of the smoothness of the language operation is improved, and the cost of manual filtering the language operation is saved. Meanwhile, the invention adopts a closed-loop multi-cycle expanding-writing mode, improves the data volume of expanding-writing words and enriches the diversity of expanding-writing words.
In an alternative embodiment, it is also possible to: and uploading the result of the closed-loop allelotography expansion writing method to a block chain.
Specifically, the corresponding digest information is obtained based on the result of the closed-loop dialect writing method, and specifically, the digest information is obtained by hashing the result of the closed-loop dialect writing method, for example, by using the sha256s algorithm. Uploading summary information to the blockchain can ensure the safety and the fair transparency of the user. The user can download the summary information from the blockchain to verify whether the result of the closed-loop dialogues writing method is tampered. The blockchain referred to in this example is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm, and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
Please refer to fig. 5, which is a schematic structural diagram of a closed-loop speech enhancement device according to an embodiment of the present invention. The closed-loop dialect augmentation apparatus 40 according to the embodiment of the present invention includes:
the first keyword extraction module 41: the method comprises the steps of obtaining a linguistic text set of labeled target intentions, and extracting a first keyword set of the target intentions from the linguistic text set; the first keyword extraction module adopts a TextRank algorithm to extract keywords from the linguistic text set, the TextRank algorithm is a sorting algorithm based on the keyword extraction of a graph, the algorithm is unsupervised learning, and the keyword extraction is carried out by utilizing the co-occurrence information (namely semantics) among the words in the linguistic text. The method specifically comprises the following steps: setting a sliding window with the length of m (the length of the sliding window is usually set to be 5, and the length can be specifically set according to an application scene), regarding all words in the same window as adjacent nodes of the nodes, constructing an undirected graph of the words, wherein different word pairs have different co-occurrence, regarding the co-occurrence as a weight of an undirected graph edge, and then extracting keywords based on the undirected graph of the words.
The first dialog write module 42: the first expanded lexical text set is used for carrying out lexical expansion by adopting a trained mT5 model based on the lexical text set and the first keyword set to generate a first expanded lexical text set of the target intention; the mT5 model is a pre-training model obtained by continuously training a T5 model on a multilingual data set C4, the mT5 model adopts an Encoder-Decoder model architecture, and both an Encoder and a Decoder are in a transformer structure. The first language-art expansion module adopts a mT5 model to conduct language-art expansion and writing, wherein the language-art expansion and writing comprises language-art expansion and writing based on a language-art text set and language-art expansion and writing based on a first keyword set, for the language-art expansion and writing based on the language-art text set, the language-art expansion and writing is similar to the semanteme of the marked language-art text in a semantic similarity mode RE2, and for the language-art expansion and writing based on the first keyword set, the language-art expansion and writing is consistent with the target intention according to the extracted keyword set.
The second keyword extraction module 43: the second keyword set is used for extracting keywords from the first word-expanding and writing text set again to obtain a target intention;
second-language-copying module 44: and the method is used for repeating the language operation expansion writing based on the first expansion writing language text set and the second keyword set to obtain the language operation expansion writing result of the target intention.
The closed-loop phonetics enlarging and writing device provided by the embodiment of the invention adopts a mode of combining based on the label text and based on the key words to enlarge and write the phonetics, so that the richness of enlarging and writing the phonetics is enriched. The invention adopts a closed-loop multi-cycle expanding and writing mode, improves the data volume of expanding and writing technologies and enriches the diversity of the expanding and writing technologies.
Fig. 6 is a schematic structural diagram of a computer device according to an embodiment of the present invention. The computer device 50 includes:
a memory 51 in which executable program code is stored;
a processor 52 connected to the memory 51;
the processor 52 calls the executable program code stored in the memory 51 to perform the steps of the closed-loop dialect-based write method disclosed in the embodiments of the present invention.
The processor 52 may also be referred to as a CPU (Central Processing Unit). Processor 52 may be an integrated circuit chip having signal processing capabilities. The processor 52 may also be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
Referring to fig. 7, fig. 7 is a schematic structural diagram of a storage medium according to an embodiment of the invention. The storage medium of the embodiment of the present invention stores a program file 61 capable of implementing all the methods described above, wherein the program file 61 may be stored in the storage medium in the form of a software product, and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute all or part of the steps of the methods of the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a mobile hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, or terminal devices, such as a computer, a server, a mobile phone, and a tablet.
In the embodiments provided in the present invention, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described system embodiments are merely illustrative, and for example, a division of a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit. The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (10)

1. A closed-loop dialect augmentation method, comprising:
acquiring a language and technology text set of a labeled target intention, and extracting a first keyword set of the target intention from the language and technology text set;
performing a linguistic diffusion writing by adopting a trained mT5 model based on the linguistic text set and the first keyword set to generate a first diffusion linguistic text set of the target intention;
extracting keywords from the first augmented typography text set again to obtain a second keyword set of the target intention;
and performing the lexical augmentation writing again based on the first augmentation writing text set and the second keyword set to obtain the lexical augmentation writing result of the target intention.
2. The closed-loop dialect augmentation method of claim 1, wherein the extracting the first set of keywords of the target intent from the set of dialect text comprises:
extracting keywords of a target intention from the linguistic text set by adopting a TextRank algorithm;
the method for extracting the keywords of the target intention from the linguistic text set by adopting the TextRank algorithm specifically comprises the following steps: setting a sliding window with the length of m, regarding all words in the same window as adjacent nodes of the nodes, constructing an undirected graph of the words, regarding the co-occurrence of different word pairs as the weight of the undirected graph edge, and extracting keywords based on the undirected graph.
3. The closed-loop dialect augmentation method of claim 2, wherein the extracting keywords of target intent from the dialect text set using a TextRank algorithm comprises:
performing sentence breaking on each phonetics text according to punctuation marks in the phonetics text;
segmenting each sentence after the sentence break, removing stop words in each sentence, performing part-of-speech tagging on each word, reserving words with specified parts-of-speech, and generating a candidate keyword set;
constructing an undirected graph G (V, E) of the words based on the candidate keyword set, wherein V is a node set, and E is an edge set;
iteratively calculating a Rank value of each node by utilizing a PageRank algorithm based on the undirected graph;
performing descending order arrangement on the Rank values of all the nodes, and selecting the first M candidate keywords as final keywords according to an ordering result;
and marking the selected keywords in the linguistic text, judging whether at least two keywords exist in the marked linguistic text to form an adjacent phrase, and if so, combining the at least two keywords forming the adjacent phrase into one keyword to obtain a first keyword set of the target intention.
4. The closed-loop conversational augmentation method of any one of claims 1 to 3, wherein the mT5 model comprises a conversational text-based augmentation model and a keyword-based augmentation model, and wherein the conversational augmentation using the trained mT5 model based on the set of conversational text and the first set of keywords comprises:
inputting the set of the phonetics text into a trained phonetics text-based augmentation writing model to generate augmentation writing technologies with similar semantemes to the phonetics text;
and inputting the first keyword set into a trained keyword-based augmentation model to generate augmentation jargon corresponding to the target intention.
5. The closed-loop dialect augmentation method of claim 4, wherein the training process of the dialect text-based augmentation model comprises:
calculating the semantic similarity between the conversational texts with the same intention through a semantic similarity model RE2, filtering out conversational pair with larger semantic difference, and generating a training set of' tagging conversational- > expanding and writing conversational; importing the training set into a publicly pre-trained mt5-base for retraining to generate a trained expanding and writing model based on a jargon text;
the training process of the keyword-based augmentation writing model comprises the following steps: constructing a training set of 'keyword \ phrase- > word expansion and writing technology', importing the training set into a publicly pre-trained mt5-base for retraining, and outputting a trained keyword-based expansion and writing model.
6. The closed-loop dialect augmentation method of claim 5, wherein said re-keyword extracting the first augmented dialect text set further comprises:
evaluating the smoothness of all the expanded transcription texts in the first expanded transcription text set by adopting an evaluation model based on BERT + MLM;
and filtering out the extended word text with the currency degree lower than a set threshold value.
7. The closed-loop dialect augmentation method of claim 6, wherein said evaluating the compliance of all the augmented dialect texts in the first set of augmented dialect texts using a BERT + MLM-based evaluation model comprises:
for each word in the first set of augmented-writings text, recursively covering each word in the augmented-writings text by adopting a BERT-based MLM model, predicting each covered word, considering the prediction to be correct if the prediction result is the same as the word in the augmented-writings text, and considering the prediction to be wrong if the prediction result is different from the word in the augmented-writings text;
counting and predicting the proportion of the correct word number to the total word number in the expanded word text as a first metric value for measuring the smoothness of the expanded word text;
predicting the position of an unopened parallel word in the extended transcription art text by adopting an error detection model based on BERT + CRF;
counting the proportion of the total number of the non-sequential words to the total number of the words of the expanded transcription text, and taking the counted proportion as a second metric value for measuring the sequential degree of the expanded transcription text;
carrying out weighted fusion on the first metric value and the second metric value to obtain a threshold value for measuring the smoothness of the phonetic text;
and filtering out the extended word text with the currency degree lower than the threshold value in the first extended word text set.
8. A closed-loop dialect augmenting device, comprising:
the first keyword extraction module: the method comprises the steps of obtaining a linguistic text set of labeled target intentions, and extracting a first keyword set of the target intentions from the linguistic text set;
a first speech augmentation module: the first expanded lexical text set is used for carrying out lexical expansion by adopting a trained mT5 model based on the lexical text set and the first keyword set, and the first expanded lexical text set of the target intention is generated;
the second keyword extraction module: the second keyword set is used for extracting keywords from the first augmented typography text set again to obtain the target intention;
a second phonetics augmented writing module: and the system is used for re-performing the linguistic augmentation writing based on the first augmentation writing text set and the second keyword set to obtain the linguistic augmentation writing result of the target intention.
9. A computer device, characterized in that the computer device comprises:
a memory storing executable program code;
a processor coupled to the memory;
the processor calls the executable program code stored in the memory to perform the closed-loop jargon write method of any one of claims 1-7.
10. A storage medium having stored thereon program instructions executable by a processor to perform the closed-loop dialect writing method of any one of claims 1 to 7.
CN202210603422.2A 2022-05-30 2022-05-30 Closed-loop dialect expanding writing method and device, computer equipment and storage medium Withdrawn CN115017870A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210603422.2A CN115017870A (en) 2022-05-30 2022-05-30 Closed-loop dialect expanding writing method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210603422.2A CN115017870A (en) 2022-05-30 2022-05-30 Closed-loop dialect expanding writing method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN115017870A true CN115017870A (en) 2022-09-06

Family

ID=83071075

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210603422.2A Withdrawn CN115017870A (en) 2022-05-30 2022-05-30 Closed-loop dialect expanding writing method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115017870A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116956835A (en) * 2023-09-15 2023-10-27 京华信息科技股份有限公司 Document generation method based on pre-training language model
CN117934229A (en) * 2024-03-18 2024-04-26 新励成教育科技股份有限公司 Originality excitation-based talent training guiding method, system, equipment and medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110990547A (en) * 2019-11-29 2020-04-10 支付宝(杭州)信息技术有限公司 Phone operation generation method and system
CN111339278A (en) * 2020-02-28 2020-06-26 支付宝(杭州)信息技术有限公司 Method and device for generating training speech generating model and method and device for generating answer speech
US20220165257A1 (en) * 2020-11-20 2022-05-26 Soundhound, Inc. Neural sentence generator for virtual assistants

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110990547A (en) * 2019-11-29 2020-04-10 支付宝(杭州)信息技术有限公司 Phone operation generation method and system
CN111339278A (en) * 2020-02-28 2020-06-26 支付宝(杭州)信息技术有限公司 Method and device for generating training speech generating model and method and device for generating answer speech
US20220165257A1 (en) * 2020-11-20 2022-05-26 Soundhound, Inc. Neural sentence generator for virtual assistants

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116956835A (en) * 2023-09-15 2023-10-27 京华信息科技股份有限公司 Document generation method based on pre-training language model
CN116956835B (en) * 2023-09-15 2024-01-02 京华信息科技股份有限公司 Document generation method based on pre-training language model
CN117934229A (en) * 2024-03-18 2024-04-26 新励成教育科技股份有限公司 Originality excitation-based talent training guiding method, system, equipment and medium

Similar Documents

Publication Publication Date Title
Hardeniya et al. Natural language processing: python and NLTK
Silberztein Formalizing natural languages: The NooJ approach
CN107798140B (en) Dialog system construction method, semantic controlled response method and device
Kumar et al. Mastering text mining with R
CN111611810A (en) Polyphone pronunciation disambiguation device and method
US10915756B2 (en) Method and apparatus for determining (raw) video materials for news
Sen et al. Bangla natural language processing: A comprehensive analysis of classical, machine learning, and deep learning-based methods
CN115048944B (en) Open domain dialogue reply method and system based on theme enhancement
CN113836938A (en) Text similarity calculation method and device, storage medium and electronic device
CN112185361B (en) Voice recognition model training method and device, electronic equipment and storage medium
Logeswaran et al. Sentence ordering using recurrent neural networks
CN108763202B (en) Method, device and equipment for identifying sensitive text and readable storage medium
Venčkauskas et al. Problems of authorship identification of the national language electronic discourse
Campesato Natural language processing fundamentals for developers
CN114722832A (en) Abstract extraction method, device, equipment and storage medium
CN115017870A (en) Closed-loop dialect expanding writing method and device, computer equipment and storage medium
CN112527967A (en) Text matching method, device, terminal and storage medium
Mekki et al. Tokenization of Tunisian Arabic: a comparison between three Machine Learning models
CN111460177A (en) Method and device for searching film and television expression, storage medium and computer equipment
Sen et al. Bangla natural language processing: A comprehensive review of classical machine learning and deep learning based methods
Lee Natural Language Processing: A Textbook with Python Implementation
Aydinov et al. Investigation of automatic part-of-speech tagging using CRF, HMM and LSTM on misspelled and edited texts
CN113536802A (en) Method, device, equipment and storage medium for judging emotion of text data in languages
Tukur et al. Parts-of-speech tagging of Hausa-based texts using hidden Markov model
CN113012685A (en) Audio recognition method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20220906