CN116611434A - Data enhancement method, device, equipment and storage medium thereof - Google Patents

Data enhancement method, device, equipment and storage medium thereof Download PDF

Info

Publication number
CN116611434A
CN116611434A CN202310508062.2A CN202310508062A CN116611434A CN 116611434 A CN116611434 A CN 116611434A CN 202310508062 A CN202310508062 A CN 202310508062A CN 116611434 A CN116611434 A CN 116611434A
Authority
CN
China
Prior art keywords
word
word segmentation
text
scene
error correction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310508062.2A
Other languages
Chinese (zh)
Inventor
袁美璐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Property and Casualty Insurance Company of China Ltd
Original Assignee
Ping An Property and Casualty Insurance Company of China Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Property and Casualty Insurance Company of China Ltd filed Critical Ping An Property and Casualty Insurance Company of China Ltd
Priority to CN202310508062.2A priority Critical patent/CN116611434A/en
Publication of CN116611434A publication Critical patent/CN116611434A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the application belongs to the technical field of data processing, and relates to a data enhancement method, a device, equipment and a storage medium thereof, wherein the method comprises the steps of acquiring seat communication voice and converting the seat communication voice into an initial text according to an ASR conversion technology; extracting a scene from the initial text; determining a corresponding target communication theme based on a scene extraction result and a preset theme scene field association table; performing error correction processing on the initial text by adopting a preset error correction model to obtain corrected text after the error correction processing; performing word segmentation processing on the corrected text to obtain word segmentation words; generating word vectors for each word in the word segmentation words; based on a preset synonymous dictionary, the word segmentation words with the top N digits are replaced, and data enhancement on corrected texts is completed. The MacBERT error correction model and the Word2vec Word vector generation model are fused, the quality correction is carried out on the seat session data set, meanwhile, the diversity of the data set can be enriched, and powerful data support is provided for subsequent service use.

Description

Data enhancement method, device, equipment and storage medium thereof
Technical Field
The present application relates to the field of data enhancement technologies, and in particular, to a data enhancement method, apparatus, device, and storage medium thereof.
Background
The agent plays a critical role in the insurance marketing process, and the insurance marketing is the most direct and key link of an insurance source. The full research and reasonable utilization of the seat speaking operation can promote marketing transformation and increase income for companies. However, the seat is thousands of people and thousands of people in the actual dialogue scene, so the collection task amount of the dialogue in the early stage of research is large, and the requirement for covering the scene is high. Wherein scene-based data extraction is a major source of data acquisition under certain topics. However, the method may affect data collection due to the manner of converting words based on voice and rule setting of human and one-sided scenes, and data enhancement is needed to alleviate the scenes with insufficient data.
Particularly in the intelligent insurance and claim settlement business of insurance business, due to the lack of dialogue corpus, the user intention cannot be truly understood and accurate feedback is given, which often means that the customer loss is caused. Therefore, how to expand the text content of a conversation according to the existing text of the conversation becomes a problem to be solved.
Disclosure of Invention
The embodiment of the application aims to provide a data enhancement method, a device, equipment and a storage medium thereof, so as to realize the expansion of the text content of a conversation according to the existing conversation text.
In order to solve the above technical problems, the embodiment of the present application provides a data enhancement method, which adopts the following technical scheme:
a method of data enhancement, comprising the steps of:
acquiring seat communication voice, and converting the communication voice into an initial text according to an ASR conversion technology;
according to a preset scene extraction model, extracting the scene of the initial text;
determining a corresponding target communication theme based on the scene extraction result and a preset theme scene field association table;
based on the wrongly written and wrongly written word form under the target communication theme, carrying out error correction processing on the initial text by adopting a preset error correction model, and obtaining corrected text after the error correction processing;
based on a Jieba word segmentation model, carrying out word segmentation processing on the corrected text to obtain word segmentation words;
generating Word vectors of each Word in the Word segmentation words by using a Word2vec Word vector generation model, and screening out N Word segmentation words with the top N digits of the correlation rank of the target communication theme from the Word segmentation words, wherein N is a positive integer;
and replacing the word segmentation words with the N digits before ranking based on a preset synonymous dictionary, and finishing data enhancement of the corrected text.
Further, before executing the step of performing scene extraction on the initial text according to the preset scene extraction model, the method further includes:
acquiring extraction keywords preset for each scene in the scene extraction model;
configuring the extracted keywords corresponding to each scene as comparison words into the scene extraction model;
the step of extracting the scene from the initial text according to a preset scene extraction model specifically includes:
acquiring the initial text input into the scene extraction model;
identifying all extraction keywords contained in the initial text by adopting a search comparison mode according to the extraction keywords configured in the scene extraction model;
determining all scenes involved in the initial text according to all the extracted keywords;
after executing the step of extracting the scene from the initial text according to the preset scene extraction model, the method further includes:
outputting all scenes and extraction keywords corresponding to all scenes respectively in a JSON array form, and taking the output JSON array as a scene extraction result.
Further, the step of determining the corresponding target communication topic based on the scene extraction result and a preset topic scene field association table specifically includes:
and searching the topics containing all the scene information in the scene extraction result from the topic scene field association table by adopting a query mode as the target communication topic, wherein the relationship between the topic field and the scene field in the topic scene field association table is one-to-many relationship, i.e. one topic field can correspond to a plurality of scene fields.
Further, before executing the step of performing error correction processing on the initial text by using a preset error correction model based on the wrongly written word form under the target communication theme to obtain the corrected text after the error correction processing, the method further includes:
obtaining wrongly written and wrongly written forms preset for all communication topics respectively;
taking the wrongly written and wrongly written forms preset for all communication subjects as configuration files, and pre-configuring the wrongly written and wrongly written forms into the error correction model;
the step of performing error correction processing on the initial text by adopting a preset error correction model based on the wrongly written word form under the target communication theme to obtain corrected text after the error correction processing specifically comprises the following steps:
Acquiring the initial text input into the error correction model;
screening the wrongly written characters in the initial text through the wrongly written character form corresponding to the target communication theme in the error correction model, and carrying out marking processing on the wrongly written characters in the initial text;
and carrying out error correction processing on the wrongly written characters in the initial text according to the marking processing result and a preset error correction table to obtain corrected text after error correction processing.
Further, the step of performing word segmentation processing on the corrected text based on the Jieba word segmentation model to obtain word segmentation words specifically includes:
acquiring the correction text input into the Jieba word segmentation model;
performing part-of-speech tagging on the corrected text through a Viterbi algorithm built in the Jieba word segmentation model;
and performing word segmentation processing on the corrected text according to the part-of-speech tagging result to obtain word segmentation words.
Further, before executing the step of generating the Word vector for each Word in the Word segmentation words by using the Word2vec Word vector generation model, the method further includes:
acquiring the Word segmentation words input into the Word2vec Word vector generation model;
the step of generating the Word vector for each Word in the Word segmentation words by using a Word2vec Word vector generation model specifically comprises the following steps:
According to a statistical algorithm in the Word2vec Word vector generation model, counting the frequency of each Word in the Word segmentation words in the corrected text;
the frequency of each word in the word segmentation words in the corrected text is used as a word vector of the corresponding word segmentation word;
the step of screening out the word segmentation words with the top N positions of the relevance rank of the target communication theme from the word segmentation words specifically comprises the following steps:
and according to the word vectors corresponding to each word, carrying out sorting processing to obtain N word segmentation words before the word vector ranking as N word segmentation words before the correlation ranking of the target communication theme.
Further, the step of replacing the word segmentation words of the N digits before ranking based on the preset synonym dictionary to complete the data enhancement of the corrected text specifically includes:
respectively acquiring text sentences corresponding to the word segmentation words with the N positions before ranking in the corrected text;
each pair of the N-bit word segmentation words in the ranking is replaced by any target word segmentation word, and the text sentence in which the target word segmentation word is located is updated to generate a new text sentence;
and acquiring all new text sentences, and adding the new text sentences into the corrected text to complete data enhancement of the corrected text.
In order to solve the above technical problems, the embodiment of the present application further provides a data enhancement device, which adopts the following technical scheme:
a data enhancement device, comprising:
the voice acquisition and conversion module is used for acquiring the seat communication voice and converting the communication voice into an initial text according to an ASR conversion technology;
the scene extraction module is used for extracting the scene of the initial text according to a preset scene extraction model;
the communication theme determining module is used for determining a corresponding target communication theme based on the scene extraction result and a preset theme scene field association table;
the text error correction processing module is used for carrying out error correction processing on the initial text by adopting a preset error correction model based on the wrongly written word form under the target communication theme to obtain corrected text after the error correction processing;
the text word segmentation processing module is used for carrying out word segmentation processing on the corrected text based on the Jieba word segmentation model to obtain word segmentation words;
the Word segmentation screening module is used for generating Word vectors of each Word in the Word segmentation words by using a Word2vec Word vector generation model, and screening out N Word segmentation words with the top N positions of the relevance rank of the target communication theme from the Word segmentation words, wherein N is a positive integer;
And the word replacement module is used for replacing the word segmentation words with N digits before ranking based on a preset synonymous dictionary, and finishing data enhancement of the corrected text.
In order to solve the above technical problems, the embodiment of the present application further provides a computer device, which adopts the following technical schemes:
a computer device comprising a memory having stored therein computer readable instructions which when executed by a processor implement the steps of the data enhancement method described above.
In order to solve the above technical problems, an embodiment of the present application further provides a computer readable storage medium, which adopts the following technical schemes:
a computer readable storage medium having stored thereon computer readable instructions which when executed by a processor implement the steps of a data enhancement method as described above.
Compared with the prior art, the embodiment of the application has the following main beneficial effects:
according to the data enhancement method provided by the embodiment of the application, the communication voice of the seat is obtained and converted into the initial text according to the ASR conversion technology; according to a preset scene extraction model, extracting the scene of the initial text; determining a corresponding target communication theme based on the scene extraction result and a preset theme scene field association table; based on the wrongly written and wrongly written word form under the target communication theme, carrying out error correction processing on the initial text by adopting a preset error correction model, and obtaining corrected text after the error correction processing; based on a Jieba word segmentation model, carrying out word segmentation processing on the corrected text to obtain word segmentation words; generating Word vectors of each Word in the Word segmentation words by using a Word2vec Word vector generation model, and screening out N Word segmentation words with the top N digits of the correlation rank of the target communication theme from the Word segmentation words, wherein N is a positive integer; and replacing the word segmentation words with the N digits before ranking based on a preset synonymous dictionary, and finishing data enhancement of the corrected text. The MacBERT error correction model and the Word2vec Word vector generation model are fused, the quality correction is carried out on the seat session data set, meanwhile, the diversity of the data set can be enriched, and powerful data support is provided for subsequent service use.
Drawings
In order to more clearly illustrate the solution of the present application, a brief description will be given below of the drawings required for the description of the embodiments of the present application, it being apparent that the drawings in the following description are some embodiments of the present application, and that other drawings may be obtained from these drawings without the exercise of inventive effort for a person of ordinary skill in the art.
FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;
FIG. 2 is a flow chart of one embodiment of a data enhancement method according to the present application;
FIG. 3 is a flow chart of one embodiment of step 202 of FIG. 2;
FIG. 4 is a flow chart of one embodiment of step 204 shown in FIG. 2;
FIG. 5 is a flow chart of one embodiment of step 205 of FIG. 2;
FIG. 6 is a flow chart of one embodiment of step 207 shown in FIG. 2;
FIG. 7 is a schematic diagram of an embodiment of a data enhancement device according to the present application;
FIG. 8 is a schematic diagram of an embodiment of a computer device in accordance with the present application.
Detailed Description
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the applications herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "comprising" and "having" and any variations thereof in the description of the application and the claims and the description of the drawings above are intended to cover a non-exclusive inclusion. The terms first, second and the like in the description and in the claims or in the above-described figures, are used for distinguishing between different objects and not necessarily for describing a sequential or chronological order.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.
In order to make the person skilled in the art better understand the solution of the present application, the technical solution of the embodiment of the present application will be clearly and completely described below with reference to the accompanying drawings.
As shown in fig. 1, a system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like. Various communication client applications, such as a web browser application, a shopping class application, a search class application, an instant messaging tool, a mailbox client, social platform software, etc., may be installed on the terminal devices 101, 102, 103.
The terminal devices 101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablet computers, electronic book readers, MP3 players (Moving Picture ExpertsGroup Audio Layer III, dynamic video expert compression standard audio plane 3), MP4 (Moving PictureExperts Group Audio Layer IV, dynamic video expert compression standard audio plane 4) players, laptop and desktop computers, and the like.
The server 105 may be a server providing various services, such as a background server providing support for pages displayed on the terminal devices 101, 102, 103.
It should be noted that, the data enhancement method provided by the embodiment of the present application is generally executed by a server/terminal device, and accordingly, the data enhancement device is generally disposed in the server/terminal device.
It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
With continued reference to FIG. 2, a flow chart of one embodiment of a data enhancement method according to the present application is shown. The data enhancement method comprises the following steps:
Step 201, an agent communication voice is obtained, and the communication voice is converted into an initial text according to an ASR conversion technology.
Through ASR speech conversion technology, the agent communication speech is converted into text, so that the computer can intuitively perform data analysis and data processing.
Step 202, performing scene extraction on the initial text according to a preset scene extraction model.
In this embodiment, before executing the step of performing scene extraction on the initial text according to the preset scene extraction model, the method further includes: acquiring extraction keywords preset for each scene in the scene extraction model; and configuring the extracted keywords corresponding to each scene into the scene extraction model by taking the extracted keywords as comparison words.
With continued reference to FIG. 3, FIG. 3 is a flow chart of one embodiment of step 202 shown in FIG. 2, comprising:
step 301, acquiring the initial text input into the scene extraction model;
step 302, identifying all extraction keywords contained in the initial text by adopting a search comparison mode according to the extraction keywords configured in the scene extraction model;
and step 303, determining all scenes involved in the initial text according to all the extracted keywords.
In this embodiment, after executing the step of performing scene extraction on the initial text according to the preset scene extraction model, the method further includes: outputting all scenes and extraction keywords corresponding to all scenes respectively in a JSON array form, and taking the output JSON array as a scene extraction result.
Step 203, determining a corresponding target communication topic based on the scene extraction result and a preset topic scene field association table.
In this embodiment, the step of determining the corresponding target communication topic based on the scene extraction result and the preset topic scene field association table specifically includes: and searching the topics containing all the scene information in the scene extraction result from the topic scene field association table by adopting a query mode as the target communication topic, wherein the relationship between the topic field and the scene field in the topic scene field association table is one-to-many relationship, i.e. one topic field can correspond to a plurality of scene fields.
And identifying the business theme used by the seat for communicating the voice by determining the target communication theme.
And 204, performing error correction processing on the initial text by adopting a preset error correction model based on the wrongly written word form under the target communication theme, and obtaining corrected text after the error correction processing.
In this embodiment, before executing the step of performing error correction processing on the initial text by using a preset error correction model based on the wrongly written word form under the target communication theme to obtain the corrected text after the error correction processing, the method further includes: obtaining wrongly written and wrongly written forms preset for all communication topics respectively; and taking the wrongly written and wrongly written word forms which are respectively set for all communication topics in advance as configuration files, and pre-configuring the wrongly written and wrongly written word forms into the error correction model.
With continued reference to fig. 4, fig. 4 is a flow chart of one embodiment of step 204 shown in fig. 2, comprising:
step 401, acquiring the initial text input into the error correction model;
step 402, screening the wrongly written words in the initial text through the wrongly written word form corresponding to the target communication theme in the error correction model, and performing marking processing on the wrongly written words in the initial text;
and step 403, performing error correction processing on the wrongly written characters in the initial text according to the marking processing result and a preset error correction table to obtain corrected text after error correction processing.
In this embodiment, the error correction model is a MacBERT model, which is collectively referred to as an MLM as error model, and is used for text error correction.
And correcting the initial text through the correction model, so that when the correction text is subjected to data enhancement processing, the data enhancement is performed based on correct text content, the data enhancement of the error text is avoided, and the quality of the text content to be subjected to the data enhancement is improved.
Step 205, based on the Jieba word segmentation model, word segmentation processing is performed on the corrected text, and word segmentation words are obtained.
With continued reference to fig. 5, fig. 5 is a flow chart of one embodiment of step 205 shown in fig. 2, comprising:
step 501, obtaining the corrected text input into the Jieba segmentation model;
step 502, marking parts of speech of the corrected text through a built-in Viterbi algorithm in the Jieba word segmentation model;
and step 503, performing word segmentation processing on the corrected text according to the part-of-speech tagging result to obtain word segmentation words.
And 206, generating Word vectors of each Word in the Word segmentation words by using a Word2vec Word vector generation model, and screening out N Word segmentation words with the top N positions of the relevance rank of the target communication theme from the Word segmentation words, wherein N is a positive integer.
In this embodiment, before executing the step of generating the Word vector for each Word in the Word segmentation Word using the Word2vec Word vector generation model, the method further includes: and acquiring the Word segmentation words input into the Word2vec Word vector generation model.
In this embodiment, the step of generating the Word vector for each Word in the Word segmentation Word by using the Word2vec Word vector generation model specifically includes: according to a statistical algorithm in the Word2vec Word vector generation model, counting the frequency of each Word in the Word segmentation words in the corrected text; and taking the frequency of each word in the word segmentation words in the corrected text as a word vector of the corresponding word segmentation word.
In this embodiment, the step of screening the word segmentation words with the top N positions of the relevance rank of the target communication theme from the word segmentation words specifically includes: and according to the word vectors corresponding to each word, carrying out sorting processing to obtain N word segmentation words before the word vector ranking as N word segmentation words before the correlation ranking of the target communication theme.
And step 207, replacing the word segmentation words with the N digits before ranking based on a preset synonymous dictionary, and finishing data enhancement of the corrected text.
With continued reference to fig. 6, fig. 6 is a flow chart of one embodiment of step 207 of fig. 2, comprising:
step 601, respectively obtaining text sentences corresponding to the word segmentation words with the N positions before ranking in the corrected text;
Step 602, each time any target word segmentation word in the N word segmentation words above the ranking is replaced, updating the text sentence in which the target word segmentation word is located, and generating a new text sentence;
step 603, obtaining all new text sentences, and adding all new text sentences into the corrected text to complete data enhancement of the corrected text.
According to the application, the communication voice of the seat is obtained and converted into an initial text according to an ASR conversion technology; according to a preset scene extraction model, extracting the scene of the initial text; determining a corresponding target communication theme based on the scene extraction result and a preset theme scene field association table; based on the wrongly written and wrongly written word form under the target communication theme, carrying out error correction processing on the initial text by adopting a preset error correction model, and obtaining corrected text after the error correction processing; based on a Jieba word segmentation model, carrying out word segmentation processing on the corrected text to obtain word segmentation words; generating Word vectors of each Word in the Word segmentation words by using a Word2vec Word vector generation model, and screening out N Word segmentation words with the top N digits of the correlation rank of the target communication theme from the Word segmentation words, wherein N is a positive integer; and replacing the word segmentation words with the N digits before ranking based on a preset synonymous dictionary, and finishing data enhancement of the corrected text. The MacBERT error correction model and the Word2vec Word vector generation model are fused, the quality correction is carried out on the seat session data set, meanwhile, the diversity of the data set can be enriched, and powerful data support is provided for subsequent service use.
The embodiment of the application can acquire and process the related data based on the artificial intelligence technology. Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results.
Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.
In the embodiment of the application, the seat session data set is subjected to quality correction by fusing the MacBERT error correction model and the Word2vec Word vector generation model, and meanwhile, the diversity of the data set can be enriched, so that powerful data support is provided for subsequent service use.
With further reference to fig. 7, as an implementation of the method shown in fig. 2 described above, the present application provides an embodiment of a data enhancement device, which corresponds to the method embodiment shown in fig. 2, and which is particularly applicable to various electronic apparatuses.
As shown in fig. 7, the data enhancement device 700 according to the present embodiment includes: a speech acquisition and conversion module 701, a scene extraction module 702, a communication topic determination module 703, a text correction processing module 704, a text word segmentation processing module 705, a word segmentation screening module 706 and a word replacement module 707. Wherein:
the voice acquisition and conversion module 701 is configured to acquire an agent communication voice, and convert the communication voice into an initial text according to an ASR conversion technology;
the scene extraction module 702 is configured to perform scene extraction on the initial text according to a preset scene extraction model;
the communication topic determination module 703 is configured to determine a corresponding target communication topic based on the scene extraction result and a preset topic scene field association table;
the text correction processing module 704 is configured to perform correction processing on the initial text by using a preset correction model based on the wrongly written word form under the target communication theme, so as to obtain corrected text after correction processing;
the text word segmentation processing module 705 is configured to perform word segmentation processing on the corrected text based on the Jieba word segmentation model to obtain word segmentation words;
the Word segmentation screening module 706 is configured to perform Word vector generation on each Word in the Word segmentation words by using a Word2vec Word vector generation model, and screen out N Word segmentation words with top N positions of the relevance rank of the target communication topic from the Word segmentation words, where N is a positive integer;
And the word replacement module 707 is configured to replace the word segmentation word with the N top ranking word based on a preset synonym dictionary, so as to complete data enhancement of the corrected text.
According to the application, the communication voice of the seat is obtained and converted into an initial text according to an ASR conversion technology; according to a preset scene extraction model, extracting the scene of the initial text; determining a corresponding target communication theme based on the scene extraction result and a preset theme scene field association table; based on the wrongly written and wrongly written word form under the target communication theme, carrying out error correction processing on the initial text by adopting a preset error correction model, and obtaining corrected text after the error correction processing; based on a Jieba word segmentation model, carrying out word segmentation processing on the corrected text to obtain word segmentation words; generating Word vectors of each Word in the Word segmentation words by using a Word2vec Word vector generation model, and screening out N Word segmentation words with the top N digits of the correlation rank of the target communication theme from the Word segmentation words, wherein N is a positive integer; and replacing the word segmentation words with the N digits before ranking based on a preset synonymous dictionary, and finishing data enhancement of the corrected text. The MacBERT error correction model and the Word2vec Word vector generation model are fused, the quality correction is carried out on the seat session data set, meanwhile, the diversity of the data set can be enriched, and powerful data support is provided for subsequent service use.
Those skilled in the art will appreciate that implementing all or part of the above described embodiment methods may be accomplished by computer readable instructions, stored on a computer readable storage medium, that the program when executed may comprise the steps of embodiments of the methods described above. The storage medium may be a nonvolatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a random access Memory (Random Access Memory, RAM).
It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited in order and may be performed in other orders, unless explicitly stated herein. Moreover, at least some of the steps in the flowcharts of the figures may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order of their execution not necessarily being sequential, but may be performed in turn or alternately with other steps or at least a portion of the other steps or stages.
In order to solve the technical problems, the embodiment of the application also provides computer equipment. Referring specifically to fig. 8, fig. 8 is a basic structural block diagram of a computer device according to the present embodiment.
The computer device 8 comprises a memory 8a, a processor 8b, a network interface 8c communicatively connected to each other via a system bus. It should be noted that only computer device 8 having components 8a-8c is shown in the figures, but it should be understood that not all of the illustrated components need be implemented, and that more or fewer components may alternatively be implemented. It will be appreciated by those skilled in the art that the computer device herein is a device capable of automatically performing numerical calculations and/or information processing in accordance with predetermined or stored instructions, the hardware of which includes, but is not limited to, microprocessors, application specific integrated circuits (Application Specific Integrated Circuit, ASICs), programmable gate arrays (fields-Programmable Gate Array, FPGAs), digital processors (Digital Signal Processor, DSPs), embedded devices, etc.
The computer equipment can be a desktop computer, a notebook computer, a palm computer, a cloud server and other computing equipment. The computer equipment can perform man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch pad or voice control equipment and the like.
The memory 8a includes at least one type of readable storage medium including flash memory, hard disk, multimedia card, card memory (e.g., SD or DX memory, etc.), random Access Memory (RAM), static Random Access Memory (SRAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), programmable Read Only Memory (PROM), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the storage 8a may be an internal storage unit of the computer device 8, such as a hard disk or a memory of the computer device 8. In other embodiments, the memory 8a may also be an external storage device of the computer device 8, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Card (Flash Card) or the like, which are provided on the computer device 8. Of course, the memory 8a may also comprise both an internal memory unit of the computer device 8 and an external memory device. In this embodiment, the memory 8a is typically used to store an operating system and various application software installed on the computer device 8, such as computer readable instructions for a data enhancement method. Further, the memory 8a may be used to temporarily store various types of data that have been output or are to be output.
The processor 8b may be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 8b is typically used to control the overall operation of the computer device 8. In this embodiment, the processor 8b is configured to execute computer readable instructions stored in the memory 8a or process data, such as computer readable instructions for executing the data enhancement method.
The network interface 8c may comprise a wireless network interface or a wired network interface, which network interface 8c is typically used to establish a communication connection between the computer device 8 and other electronic devices.
The embodiment provides computer equipment, which belongs to the technical field of data enhancement. According to the application, the communication voice of the seat is obtained and converted into an initial text according to an ASR conversion technology; according to a preset scene extraction model, extracting the scene of the initial text; determining a corresponding target communication theme based on the scene extraction result and a preset theme scene field association table; based on the wrongly written and wrongly written word form under the target communication theme, carrying out error correction processing on the initial text by adopting a preset error correction model, and obtaining corrected text after the error correction processing; based on a Jieba word segmentation model, carrying out word segmentation processing on the corrected text to obtain word segmentation words; generating Word vectors of each Word in the Word segmentation words by using a Word2vec Word vector generation model, and screening out N Word segmentation words with the top N digits of the correlation rank of the target communication theme from the Word segmentation words, wherein N is a positive integer; and replacing the word segmentation words with the N digits before ranking based on a preset synonymous dictionary, and finishing data enhancement of the corrected text. The MacBERT error correction model and the Word2vec Word vector generation model are fused, the quality correction is carried out on the seat session data set, meanwhile, the diversity of the data set can be enriched, and powerful data support is provided for subsequent service use.
The present application also provides another embodiment, namely, a computer-readable storage medium storing computer-readable instructions executable by a processor to cause the processor to perform the steps of the data enhancement method as described above.
The embodiment provides a computer readable storage medium, which belongs to the technical field of data enhancement. According to the application, the communication voice of the seat is obtained and converted into an initial text according to an ASR conversion technology; according to a preset scene extraction model, extracting the scene of the initial text; determining a corresponding target communication theme based on the scene extraction result and a preset theme scene field association table; based on the wrongly written and wrongly written word form under the target communication theme, carrying out error correction processing on the initial text by adopting a preset error correction model, and obtaining corrected text after the error correction processing; based on a Jieba word segmentation model, carrying out word segmentation processing on the corrected text to obtain word segmentation words; generating Word vectors of each Word in the Word segmentation words by using a Word2vec Word vector generation model, and screening out N Word segmentation words with the top N digits of the correlation rank of the target communication theme from the Word segmentation words, wherein N is a positive integer; and replacing the word segmentation words with the N digits before ranking based on a preset synonymous dictionary, and finishing data enhancement of the corrected text. The MacBERT error correction model and the Word2vec Word vector generation model are fused, the quality correction is carried out on the seat session data set, meanwhile, the diversity of the data set can be enriched, and powerful data support is provided for subsequent service use.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method according to the embodiments of the present application.
It is apparent that the above-described embodiments are only some embodiments of the present application, but not all embodiments, and the preferred embodiments of the present application are shown in the drawings, which do not limit the scope of the patent claims. This application may be embodied in many different forms, but rather, embodiments are provided in order to provide a thorough and complete understanding of the present disclosure. Although the application has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments described in the foregoing description, or equivalents may be substituted for elements thereof. All equivalent structures made by the content of the specification and the drawings of the application are directly or indirectly applied to other related technical fields, and are also within the scope of the application.

Claims (10)

1. A method of data enhancement, comprising the steps of:
acquiring seat communication voice, and converting the communication voice into an initial text according to an ASR conversion technology;
according to a preset scene extraction model, extracting the scene of the initial text;
determining a corresponding target communication theme based on the scene extraction result and a preset theme scene field association table;
based on the wrongly written and wrongly written word form under the target communication theme, carrying out error correction processing on the initial text by adopting a preset error correction model, and obtaining corrected text after the error correction processing;
based on a Jieba word segmentation model, carrying out word segmentation processing on the corrected text to obtain word segmentation words;
generating Word vectors of each Word in the Word segmentation words by using a Word2vec Word vector generation model, and screening out N Word segmentation words with the top N digits of the correlation rank of the target communication theme from the Word segmentation words, wherein N is a positive integer;
and replacing the word segmentation words with the N digits before ranking based on a preset synonymous dictionary, and finishing data enhancement of the corrected text.
2. The data enhancement method according to claim 1, wherein before performing the step of scene extraction of the initial text according to a preset scene extraction model, the method further comprises:
Acquiring extraction keywords preset for each scene in the scene extraction model;
configuring the extracted keywords corresponding to each scene as comparison words into the scene extraction model;
the step of extracting the scene from the initial text according to a preset scene extraction model specifically includes:
acquiring the initial text input into the scene extraction model;
identifying all extraction keywords contained in the initial text by adopting a search comparison mode according to the extraction keywords configured in the scene extraction model;
determining all scenes involved in the initial text according to all the extracted keywords;
after executing the step of extracting the scene from the initial text according to the preset scene extraction model, the method further includes:
outputting all scenes and extraction keywords corresponding to all scenes respectively in a JSON array form, and taking the output JSON array as a scene extraction result.
3. The method for enhancing data according to claim 1, wherein the step of determining the corresponding target communication topic based on the scene extraction result and a preset topic scene field association table specifically includes:
And searching the topics containing all the scene information in the scene extraction result from the topic scene field association table by adopting a query mode as the target communication topic, wherein the relationship between the topic field and the scene field in the topic scene field association table is one-to-many relationship, i.e. one topic field can correspond to a plurality of scene fields.
4. The data enhancement method according to claim 1 or 3, wherein, before the step of executing the error correction processing on the initial text using a preset error correction model based on the wrongly written word form under the target communication topic to obtain corrected text after the error correction processing, the method further comprises:
obtaining wrongly written and wrongly written forms preset for all communication topics respectively;
taking the wrongly written and wrongly written forms preset for all communication subjects as configuration files, and pre-configuring the wrongly written and wrongly written forms into the error correction model;
the step of performing error correction processing on the initial text by adopting a preset error correction model based on the wrongly written word form under the target communication theme to obtain corrected text after the error correction processing specifically comprises the following steps:
acquiring the initial text input into the error correction model;
Screening the wrongly written characters in the initial text through the wrongly written character form corresponding to the target communication theme in the error correction model, and carrying out marking processing on the wrongly written characters in the initial text;
and carrying out error correction processing on the wrongly written characters in the initial text according to the marking processing result and a preset error correction table to obtain corrected text after error correction processing.
5. The data enhancement method according to claim 4, wherein the step of performing word segmentation processing on the corrected text based on the Jieba word segmentation model to obtain word segmentation words specifically includes:
acquiring the correction text input into the Jieba word segmentation model;
performing part-of-speech tagging on the corrected text through a Viterbi algorithm built in the Jieba word segmentation model;
and performing word segmentation processing on the corrected text according to the part-of-speech tagging result to obtain word segmentation words.
6. The data enhancement method of claim 5, wherein prior to performing said step of generating a Word vector for each of said segmented words using a Word2vec Word vector generation model, said method further comprises:
acquiring the Word segmentation words input into the Word2vec Word vector generation model;
The step of generating the Word vector for each Word in the Word segmentation words by using a Word2vec Word vector generation model specifically comprises the following steps:
according to a statistical algorithm in the Word2vec Word vector generation model, counting the frequency of each Word in the Word segmentation words in the corrected text;
the frequency of each word in the word segmentation words in the corrected text is used as a word vector of the corresponding word segmentation word;
the step of screening out the word segmentation words with the top N positions of the relevance rank of the target communication theme from the word segmentation words specifically comprises the following steps:
and according to the word vectors corresponding to each word, carrying out sorting processing to obtain N word segmentation words before the word vector ranking as N word segmentation words before the correlation ranking of the target communication theme.
7. The data enhancement method according to claim 6, wherein the pre-set synonym dictionary based, replacing the word segmentation words with the N digits before ranking to complete the step of enhancing the data of the corrected text, and specifically comprises the following steps:
respectively acquiring text sentences corresponding to the word segmentation words with the N positions before ranking in the corrected text;
Each pair of the N-bit word segmentation words in the ranking is replaced by any target word segmentation word, and the text sentence in which the target word segmentation word is located is updated to generate a new text sentence;
and acquiring all new text sentences, and adding the new text sentences into the corrected text to complete data enhancement of the corrected text.
8. A data enhancement device, comprising:
the voice acquisition and conversion module is used for acquiring the seat communication voice and converting the communication voice into an initial text according to an ASR conversion technology;
the scene extraction module is used for extracting the scene of the initial text according to a preset scene extraction model;
the communication theme determining module is used for determining a corresponding target communication theme based on the scene extraction result and a preset theme scene field association table;
the text error correction processing module is used for carrying out error correction processing on the initial text by adopting a preset error correction model based on the wrongly written word form under the target communication theme to obtain corrected text after the error correction processing;
the text word segmentation processing module is used for carrying out word segmentation processing on the corrected text based on the Jieba word segmentation model to obtain word segmentation words;
The Word segmentation screening module is used for generating Word vectors of each Word in the Word segmentation words by using a Word2vec Word vector generation model, and screening out N Word segmentation words with the top N positions of the relevance rank of the target communication theme from the Word segmentation words, wherein N is a positive integer;
and the word replacement module is used for replacing the word segmentation words with N digits before ranking based on a preset synonymous dictionary, and finishing data enhancement of the corrected text.
9. A computer device comprising a memory having stored therein computer readable instructions which when executed by a processor implement the steps of the data enhancement method of any of claims 1 to 7.
10. A computer readable storage medium having stored thereon computer readable instructions which when executed by a processor implement the steps of the data enhancement method according to any of claims 1 to 7.
CN202310508062.2A 2023-05-06 2023-05-06 Data enhancement method, device, equipment and storage medium thereof Pending CN116611434A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310508062.2A CN116611434A (en) 2023-05-06 2023-05-06 Data enhancement method, device, equipment and storage medium thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310508062.2A CN116611434A (en) 2023-05-06 2023-05-06 Data enhancement method, device, equipment and storage medium thereof

Publications (1)

Publication Number Publication Date
CN116611434A true CN116611434A (en) 2023-08-18

Family

ID=87681033

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310508062.2A Pending CN116611434A (en) 2023-05-06 2023-05-06 Data enhancement method, device, equipment and storage medium thereof

Country Status (1)

Country Link
CN (1) CN116611434A (en)

Similar Documents

Publication Publication Date Title
CN112287069B (en) Information retrieval method and device based on voice semantics and computer equipment
US11238050B2 (en) Method and apparatus for determining response for user input data, and medium
CN111783471B (en) Semantic recognition method, device, equipment and storage medium for natural language
CN113722438B (en) Sentence vector generation method and device based on sentence vector model and computer equipment
CN111694937A (en) Interviewing method and device based on artificial intelligence, computer equipment and storage medium
CN111985243B (en) Emotion model training method, emotion analysis device and storage medium
CN112084752B (en) Sentence marking method, device, equipment and storage medium based on natural language
CN111767394A (en) Abstract extraction method and device based on artificial intelligence expert system
CN116796857A (en) LLM model training method, device, equipment and storage medium thereof
CN115438149A (en) End-to-end model training method and device, computer equipment and storage medium
CN112949320B (en) Sequence labeling method, device, equipment and medium based on conditional random field
CN112182157B (en) Training method of online sequence labeling model, online labeling method and related equipment
CN112199954B (en) Disease entity matching method and device based on voice semantics and computer equipment
CN117034230A (en) Data verification method, device, equipment and storage medium thereof
CN116881446A (en) Semantic classification method, device, equipment and storage medium thereof
CN117234505A (en) Interactive page generation method, device, equipment and storage medium thereof
CN116796730A (en) Text error correction method, device, equipment and storage medium based on artificial intelligence
CN116701593A (en) Chinese question-answering model training method based on GraphQL and related equipment thereof
CN115169370B (en) Corpus data enhancement method and device, computer equipment and medium
CN115756692A (en) Method for automatically combining and displaying pages based on style attributes and related equipment thereof
CN116611434A (en) Data enhancement method, device, equipment and storage medium thereof
CN113569741A (en) Answer generation method and device for image test questions, electronic equipment and readable medium
CN113807148B (en) Text recognition matching method and device and terminal equipment
CN113343668B (en) Method and device for solving selected questions, electronic equipment and readable storage medium
CN116701512A (en) Inter-server data call acceleration method, inter-server data call acceleration device, inter-server data call acceleration equipment and storage medium of inter-server data call acceleration equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination