CN114861650A - Method and device for cleaning noise data, storage medium and electronic equipment - Google Patents

Method and device for cleaning noise data, storage medium and electronic equipment Download PDF

Info

Publication number
CN114861650A
CN114861650A CN202210382774.XA CN202210382774A CN114861650A CN 114861650 A CN114861650 A CN 114861650A CN 202210382774 A CN202210382774 A CN 202210382774A CN 114861650 A CN114861650 A CN 114861650A
Authority
CN
China
Prior art keywords
preset
noise data
vector
probability matrix
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210382774.XA
Other languages
Chinese (zh)
Other versions
CN114861650B (en
Inventor
虞天
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dazhu Hangzhou Technology Co ltd
Original Assignee
Dazhu Hangzhou Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dazhu Hangzhou Technology Co ltd filed Critical Dazhu Hangzhou Technology Co ltd
Priority to CN202210382774.XA priority Critical patent/CN114861650B/en
Publication of CN114861650A publication Critical patent/CN114861650A/en
Application granted granted Critical
Publication of CN114861650B publication Critical patent/CN114861650B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Operations Research (AREA)
  • Machine Translation (AREA)

Abstract

The application provides a method and a device for cleaning noise data, a storage medium and electronic equipment, wherein the method comprises the following steps: acquiring all texts contained in the entity identification project; generating a training data set according to the whole text; converting the training data set into a preset vector, wherein the preset vector is a word vector or a word vector; calling a preset model corresponding to the entity identification item according to the preset vector; generating a prediction probability matrix according to a preset vector and a preset model; generating a noise data index according to the prediction probability matrix and a preset frame; determining noise data according to the prediction probability matrix, the first preset probability threshold and the second preset probability threshold; and cleaning the noise data in the whole text according to the noise data index to generate purified data. The method improves the accuracy of the prediction probability matrix, and further improves the accuracy of the recognition and cleaning of the noise data.

Description

Method and device for cleaning noise data, storage medium and electronic equipment
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a method and an apparatus for cleaning noise data, a storage medium, and an electronic device.
Background
At present, a confidence learning open source framework based clearlab inherits a scimit-lean application programming interface, supports calling of a traditional machine learning model, does not support a sequence model commonly used by an entity recognition model, needs a developer to develop a calling module of the sequence model, and otherwise, directly adopts the traditional machine learning model.
Disclosure of Invention
In view of this, the present application provides a method and an apparatus for cleaning noise data, a storage medium, and an electronic device, which achieve improvement of accuracy of a prediction probability matrix, and further improve accuracy of recognition and cleaning of noise data.
According to an aspect of the present application, there is provided a method of cleaning noise data, the method comprising: acquiring all texts contained in the entity identification project; generating a training data set according to the whole text; converting the training data set into a preset vector, wherein the preset vector is a word vector or a word vector; calling a preset model corresponding to the entity identification item according to the preset vector; generating a prediction probability matrix according to a preset vector and a preset model; generating a noise data index according to the prediction probability matrix and a preset frame; determining noise data according to the prediction probability matrix, the first preset probability threshold and the second preset probability threshold; and cleaning the noise data in the whole text according to the noise data index to generate purified data.
Optionally, the step of calling a preset model corresponding to the entity identification item according to the preset vector specifically includes: under the condition that the preset vector is a word vector, calling a sequence model corresponding to the entity identification project; and calling a regression model corresponding to the entity identification item under the condition that the preset vector is the word vector.
Optionally, the step of generating the prediction probability matrix according to the preset vector and the preset model specifically includes: under the condition that the preset vector is a word vector, the word vector is led into a sequence model, and a prediction probability matrix of the whole text is generated in a cross validation mode; and under the condition that the preset vector is a word vector, introducing the word vector into a regression model, and generating a prediction probability matrix of the whole text in a cross validation mode.
Optionally, the step of generating the noise data index according to the prediction probability matrix and a preset frame specifically includes: acquiring a data tag of an entity identification item; and inputting the prediction probability matrix and the data label into a preset frame to generate a noise data index.
Optionally, the step of determining the noise data according to the prediction probability matrix, the first preset probability threshold and the second preset probability threshold specifically includes: determining a confidence joint distribution matrix according to the prediction probability matrix; determining a noise joint probability matrix according to the confidence joint distribution matrix; generating a cleaning rule of the noise data according to the first preset probability threshold and the second preset probability threshold; and determining noise data in the prediction probability matrix according to the noise joint probability matrix and the cleaning rule.
Optionally, after the noise data in the whole text is cleaned according to the noise data index, and the clean data is generated, the method further includes: and (5) training the regression model by taking the purified data as a training sample.
Optionally, the sequence model comprises at least one of: the BilSTM-CRF model and the BERT model.
According to another aspect of the present application, there is provided a noise data cleaning apparatus, the apparatus including: the acquisition module is used for acquiring all texts contained in the entity identification item; the first generation module is used for generating a training data set according to the whole text; the conversion module is used for converting the training data set into a preset vector, wherein the preset vector comprises a word vector or a word vector; the calling module is used for calling a preset model corresponding to a preset vector in the entity identification project according to the preset vector; the second generation module is used for generating a prediction probability matrix according to the preset vector and the preset model; the third generation module is used for generating a noise data index according to the prediction probability matrix and a preset frame; the determining module is used for determining the noise data according to the prediction probability matrix, the first preset probability threshold and the second preset probability threshold; and the fourth generation module is used for cleaning the noise data in the whole text according to the noise data index to generate the purified data.
According to yet another aspect of the present application, a storage medium is provided, on which a computer program is stored, which program, when being executed by a processor, carries out the above-mentioned method of cleaning of noise data.
According to still another aspect of the present application, there is provided an electronic device, including a storage medium, a processor, and a computer program stored on the storage medium and executable on the processor, wherein the processor implements the above-mentioned noise data cleaning method when executing the computer program.
By means of the technical scheme, the entity data set is stripped from the whole text and then converted into word vectors, the prediction probability matrix is calculated by adopting a cross validation and regression model method, then entity noise data is cleaned, and the entity noise data is processed by converting sequence problems into classification problems. Furthermore, the entity noise data is processed based on a mode of solving the sequence problem by converting the whole text into word vectors, calculating a prediction probability matrix by adopting a cross validation + sequence model method and further cleaning the entity noise data. The problem that the fitting capability of a traditional machine learning model to the sequence text is poor when the traditional machine learning model is directly called in the prior art is solved, the accuracy of the probability matrix is improved, and the accuracy of the recognition and cleaning of the noise data is improved.
The foregoing description is only an overview of the technical solutions of the present application, and the present application can be implemented according to the content of the description in order to make the technical means of the present application more clearly understood, and the following detailed description of the present application is given in order to make the above and other objects, features, and advantages of the present application more clearly understandable.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
FIG. 1 is a flow chart illustrating a method for cleaning noise data according to an embodiment of the present disclosure;
FIG. 2 is a block diagram illustrating a method for cleaning noise data according to an embodiment of the present disclosure;
fig. 3 shows a block diagram of a noise data cleaning apparatus according to an embodiment of the present application.
Detailed Description
The present application will be described in detail below with reference to the accompanying drawings in conjunction with embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
In this embodiment, a method for cleaning noise data is provided, as shown in fig. 1, the method includes:
step 101, acquiring all texts contained in an entity identification project;
102, generating a training data set according to the whole text;
step 103, converting the training data set into a preset vector;
step 104, calling a preset model corresponding to a preset vector in an entity identification project according to the preset vector;
105, generating a prediction probability matrix according to a preset vector and a preset model;
106, generating a noise data index according to the prediction probability matrix and a preset frame;
step 107, determining noise data according to the prediction probability matrix, the first preset probability threshold and the second preset probability threshold;
and 108, cleaning the noise data in the whole text according to the noise data index to generate purified data.
The embodiment of the application provides a method for cleaning noise data, and specifically, after all texts in an entity recognition project are obtained, a training data set is generated according to all texts. After all texts of the entity implementation project are taken, if all texts are retained and a training data set is generated by all the texts, X in the generated training data set is a character, and Y is a label type of the character; and if entity characters are extracted from the whole text, and the entity is taken as the minimum granularity to generate a training data set, wherein X of the generated training data set is the entity, and Y is the label type of the entity.
Further, the structural accuracy of the belief learning framework depends on the accuracy of the input probability prediction matrix, it should be noted that the prediction probability matrix refers to the probability distribution of the whole text on all tag types, and the accuracy of the prediction probability matrix depends on the choice of the learning module. Therefore, when the whole text is retained and the training data set is generated based on the whole text, the training data set is converted into a word vector by the word embedding layer. Thereafter, a sequence model used in the entity identification project is invoked, and a prediction probability matrix is generated based on the word vectors and the sequence model. In addition, when the entity text is extracted, after a training data set is generated according to the entity text, the training data set is converted into word vectors through a word embedding layer. And then, calling a regression model used in the entity recognition model, and generating a preset probability matrix according to the word vector and the regression model. By calling the sequence model used in the entity recognition project, the fitting capability of the sequence model to the sequence text is improved, and the accuracy of the generated prediction probability matrix is further improved.
Wherein, the sequence model can be a BilSTM + CRF model and a refined BERT model. The regression model may be an LSMT model or a BERT model.
Further, a prediction probability matrix of the whole text is calculated by taking a preset vector as a training sample and adopting a mode of cross validation plus a preset model. Then, a CL frame, namely a preset frame, is called, a preset probability matrix is input, and a noise data index is generated. Noise data is identified in the predictive probability matrix using the first and second predetermined probability thresholds.
Further, when the preset vector is a word vector, the noise data is mapped back to the whole text data set according to the noise data index, and the entity tag of the noise data is modified into the preset tag, for example, the entity tag is modified into "O", so as to clean the noise data in the whole text, and finally generate cleaned purified data. In addition, when the preset vector is a word vector, the noise data index is mapped from the entity data set back to the entity data set, and the entity tag of the noise data is modified to the preset tag, for example, the entity tag is modified to "O", to generate the cleansing data.
According to the cleaning method for the noise data, the entity data set is stripped from the whole text and then converted into the word vector, the prediction probability matrix is calculated by adopting a cross validation and regression model method, then the entity noise data is cleaned, and the fact that the sequence problem is converted into the classification problem to process the entity noise data is achieved. Furthermore, the entity noise data is processed based on a mode of solving the sequence problem by converting all texts into word vectors, calculating a prediction probability matrix by adopting a cross validation + sequence model method and further cleaning the entity noise data. The problem that the fitting capability of a traditional machine learning model to the sequence text is poor when the traditional machine learning model is directly called in the prior art is solved, the accuracy of the probability matrix is improved, and the accuracy of the recognition and cleaning of the noise data is improved.
In the embodiment of the present application, further, the step of calling a preset model corresponding to the entity identification item according to the preset vector specifically includes: under the condition that the preset vector is a word vector, calling a sequence model corresponding to the entity identification project; and calling a regression model corresponding to the entity identification item under the condition that the preset vector is the word vector.
In the technical scheme, the accuracy of the belief learning frame result often depends on the accuracy of the input prediction probability matrix, and the accuracy of the prediction probability matrix often depends on the selection of the learning model. Therefore, when the whole text is retained and the training data set is generated based on the whole text, the training data set is converted into a word vector by the word embedding layer. Thereafter, a sequence model used in the entity identification project is invoked, and a prediction probability matrix is generated based on the word vectors and the sequence model. Further, when the entity text is extracted, after a training data set is generated according to the entity text, the training data set is converted into word vectors through a word embedding layer. And then, calling a regression model used in the entity recognition model, and generating a preset probability matrix according to the word vector and the regression model.
In the embodiment of the present application, further, the step of generating the prediction probability matrix according to the preset vector and the preset model specifically includes: under the condition that the preset vector is a word vector, the word vector is led into a sequence model, and a prediction probability matrix of the whole text is generated in a cross validation mode; and under the condition that the preset vector is a word vector, introducing the word vector into a regression model, and generating a prediction probability matrix of the whole text in a cross validation mode.
In the technical scheme, after the whole text data is converted into word vectors, a sequence model used in an entity recognition project is called, the word vectors are used as training data, and a prediction probability matrix is calculated in a mode of cross validation plus the sequence model. Further, after the entity text data are converted into word vectors, a regression model used in the entity recognition model is called, the word vectors are used as training data, and the prediction probability matrix is calculated in a mode of cross validation and regression model.
Specifically, the training data is divided into n equal parts, the first n-1 parts of data training models are adopted, and the models are called to predict the probability distribution of the nth part of data on all label categories. And then, sequentially traversing by using the (1, 2, … …, n-2) and the nth data as prediction data and the (n-1) th data as prediction data to obtain probability distribution of the whole training data on all label categories, namely a prediction probability matrix.
In this embodiment of the present application, further, the step of generating the noise data index according to the prediction probability matrix and the preset frame specifically includes: acquiring a data tag of an entity identification item; and importing the prediction probability matrix and the data label into a preset frame to generate a noise data index.
In the technical scheme, after the noise data are screened out from the prediction probability matrix, the positions of the noise data need to be determined in the whole text, so that a preset frame, namely a CL frame, is called, and the prediction probability matrix and the data labels are input into the CL frame to generate a noise data index, namely an index of the positions of the noise data in the whole text. Therefore, the noise data index is used for generating the purification data in the following process, and the efficiency of cleaning the noise data is effectively improved.
In the embodiment of the present application, further, the step of determining the noise data according to the prediction probability matrix, the first preset probability threshold and the second preset probability threshold specifically includes: determining a confidence joint distribution matrix according to the prediction probability matrix; determining a noise joint probability matrix according to the confidence joint distribution matrix; generating a cleaning rule of the noise data according to the first preset probability threshold and the second preset probability threshold; and determining noise data in the prediction probability matrix according to the noise joint probability matrix and the cleaning rule.
In the technical scheme, a confidence joint distribution matrix is calculated based on a prediction probability matrix, and a noise data probability matrix is approximately estimated through the normalized confidence joint distribution matrix. And further, generating a cleaning rule according to the intersection of the first preset probability threshold and the second preset probability threshold. And further, screening out noise data from the prediction probability matrix according to the noise joint probability matrix and a cleaning rule.
It should be noted that the first preset probability threshold is a threshold of pbc (prune by class), which corresponds to a lowest confidence (lowest confidence), where the lowest confidence is understood as the probability that each data belongs to a normal tag. If the probability is lower than the first preset probability threshold, the probability that the data belongs to the normal label is lower. Further, the second predetermined probability threshold is a threshold of pbnr (round by Noise rate), which corresponds to normalized margin, wherein the normalized margin is understood as the difference between the probability that the piece of data is correctly labeled and the probability that the piece of data is incorrectly labeled. If the probability is lower than the second preset probability threshold, the probability that the data is marked with the wrong label is high, and the probability that the data is marked with the correct label is low. The accuracy of the result of the belief learning frame in the prior art also depends on the noise data cleaning rule of the frame, the application effectively improves the accuracy of the cleaning rule by taking the lowest confidence as the first preset probability threshold value, taking the standardized margin as the second preset probability threshold value and determining the cleaning rule through the intersection of the two threshold values, thereby improving the accuracy of screening the noise data in the prediction probability matrix, stabilizing the quality of the sequence data and further improving the accuracy of the entity extraction result of the entity model.
In this embodiment of the application, further, after the noise data in the whole text is cleaned according to the noise data index, and the clean data is generated, the method further includes: and (5) training the regression model by taking the purified data as a training sample.
In the technical scheme, in an entity recognition project, noise data in all texts are cleaned by adopting a belief learning method, purified data are obtained, and then a regression model used in the entity recognition project is trained by adopting the purified data as a training sample. So as to improve the identification accuracy of the regression model.
In a practical application scenario, by reading the core paper of belief learning and the open source code discovery of clearlab, the result accuracy of the belief learning framework depends on two factors: the accuracy of the input prediction probability matrix and the dirty data cleansing rules selected within the framework. Wherein, the prediction probability matrix is the probability distribution of the whole data on all the label types. The calculation mode of the prediction probability matrix is cross validation + regression model, and the accuracy of the prediction probability matrix depends on the selection of the regression model. In the prior art, a belit-lean API (application programming interface) is inherited on a belit-lean framework based on a belief learning open source framework, calling of a traditional machine learning model is supported, a sequence model commonly used by an entity recognition model is not supported, a developer needs to develop a calling module of the sequence model, otherwise, the traditional machine learning model is directly adopted, the model has poor fitting capability to a sequence text, the accuracy of an obtained probability prediction matrix is low, and finally the accuracy of finding dirty data by a dirty data tool is low. Based on the above problems, the present application proposes to process entity noise data, both to convert the sequence problem into a classification problem or to solve the sequence problem. Specifically, as shown in fig. 2, the way to convert the sequence problem into classification problem processing entity noise data is: after sample data with noise in an entity recognition project is taken as a default, namely all texts, based on the angle of converting a sequence problem into a classification problem, extracting entity texts of a BI label from the sample data, generating a training data set by taking the entities as minimum granularity, converting the training data into word vectors, and calling a scinit-leann regression model for traditional machine learning to calculate a prediction probability matrix. And calculating a probability prediction matrix by adopting a method of cross validation and regression model, wherein the default of the cross validation is 5 times. And further, calling a CL framework, inputting the prediction probability matrix and the data labels into the CL framework, and generating a noise data index. And then, a counting step is carried out, namely a confidence joint matrix is calculated, a noise data probability matrix in the sample data is approximately estimated through a regularization confidence joint matrix, the lowest confidence is respectively used as a first preset probability threshold and the standardized margin is used as a second preset probability threshold, the intersection of the two preset probability thresholds is taken, a cleaning rule is generated, and the noise data is screened in the prediction probability matrix by utilizing the cleaning rule. Further, the noise data index is mapped from the entity data set back to the entity data set, the entity label of the noise data is modified to "O", and the cleansing data is generated. Further, the way to process the entity noise data by solving the sequence problem is: after sample data of an entity recognition project, namely an entire text is taken, the entire text is kept to generate a training data set based on the view of solving a sequence problem, wherein X of the training data set is characters, Y is label types of the characters, then, the training data is converted into word vectors, a sequence model used in the entity recognition project, such as a BilSTM + CRF model or a refined BERT model, is called, and a prediction probability matrix is calculated, wherein the sequence model has to be consistent with the entity model in the entity recognition project. And calculating a probability prediction matrix by adopting a method of cross validation and sequence model, wherein the default of the cross validation is 5 times. And further, calling a CL framework, inputting the prediction probability matrix and the data labels into the CL framework, and generating a noise data index. And then, a counting step is carried out, namely a confidence joint matrix is calculated, a noise data probability matrix in the sample data is approximately estimated through a regularization confidence joint matrix, the lowest confidence is respectively used as a first preset probability threshold and the standardized margin is used as a second preset probability threshold, the intersection of the two preset probability thresholds is taken, a cleaning rule is generated, and the noise data is screened in the prediction probability matrix by utilizing the cleaning rule. And further, modifying the entity label of the noise data in the whole text into O according to the noise data index, and generating the purified data.
According to the method and the device, based on the basis of the belief learning theory, the noise data in the entity naming recognition task are cleaned, more noise data are deleted, the data quality can be greatly improved, and the recognition accuracy of the training model is further improved.
Further, as a specific implementation of the above noise data cleaning method, an embodiment of the present application provides a noise data cleaning apparatus 300, as shown in fig. 3, the noise data cleaning apparatus 300 includes:
an obtaining module 301, configured to obtain all texts included in the entity identification item;
a first generating module 302, configured to generate a training data set according to the whole text;
a conversion module 303, configured to convert the training data set into a preset vector, where the preset vector includes a word vector or a word vector;
the calling module 304 is configured to call a preset model corresponding to a preset vector in the entity identification project according to the preset vector;
a second generating module 305, configured to generate a prediction probability matrix according to a preset vector and a preset model;
a third generating module 306, configured to generate a noise data index according to the prediction probability matrix and a preset frame;
a determining module 307, configured to determine noise data according to the prediction probability matrix, the first preset probability threshold, and the second preset probability threshold;
and a fourth generating module 308, configured to clean the noise data in the whole text according to the noise data index, and generate the clean data.
It should be noted that, other corresponding descriptions of the functional modules related to the cleaning device for noise data provided in the embodiment of the present application may refer to the corresponding description in fig. 1, and are not described herein again.
Based on the method shown in fig. 1, correspondingly, the embodiment of the present application further provides a storage medium, on which a computer program is stored, and the program, when executed by a processor, implements the method for cleaning the noise data shown in fig. 1.
Based on such understanding, the technical solution of the present application may be embodied in the form of a software product or a hardware product or a combination of software and hardware, where the software product may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.), and includes several instructions to enable an electronic device (which may be a personal computer, a server, or a network device, etc.) to execute the method of the various implementation scenarios of the present application.
Based on the method shown in fig. 1 and the embodiment of the cleaning apparatus for noise data shown in fig. 3, in order to achieve the above object, the embodiment of the present application further provides an electronic device, which may be a personal computer, a server, a network device, and the like, and includes a storage medium and a processor; a storage medium for storing a computer program; a processor for executing a computer program to implement the above-described method of cleaning noise data as shown in fig. 1.
Optionally, the electronic device may further include a user interface, a network interface, a camera, Radio Frequency (RF) circuitry, sensors, audio circuitry, a WI-FI module, and so forth. The user interface may include a Display screen (Display), an input unit such as a keypad (Keyboard), etc., and the optional user interface may also include a USB interface, a card reader interface, etc. The network interface may optionally include a standard wired interface, a wireless interface (e.g., a bluetooth interface, WI-FI interface), etc.
It will be understood by those skilled in the art that the present embodiment provides an electronic device structure that is not limiting of the electronic device, and may include more or fewer components, or some components in combination, or a different arrangement of components.
The storage medium may further include an operating system and a network communication module. An operating system is a program that manages and maintains the hardware and software resources of an electronic device, supporting the operation of information handling programs, as well as other software and/or programs. The network communication module is used for realizing communication among the controls in the storage medium and communication with other hardware and software in the entity equipment.
Through the above description of the embodiments, those skilled in the art will clearly understand that the present application can be implemented by software plus a necessary general hardware platform, and can also be implemented by hardware.
Those skilled in the art will appreciate that the drawings are merely schematic representations of preferred embodiments and that the elements or acts in the drawings are not necessarily required to practice the present application. Those skilled in the art will appreciate that elements of a device in an implementation scenario may be distributed in the device in the implementation scenario according to the description of the implementation scenario, or may be located in one or more devices different from the present implementation scenario with corresponding changes. The units of the implementation scenario may be combined into one unit, or may be further split into a plurality of sub-units.
The above application serial numbers are for description purposes only and do not represent the superiority or inferiority of the implementation scenarios. The above disclosure is only a few specific implementation scenarios of the present application, but the present application is not limited thereto, and any variations that can be made by those skilled in the art are intended to fall within the scope of the present application.

Claims (10)

1. A method of cleaning noise data, the method comprising:
acquiring all texts contained in the entity identification project;
generating a training data set according to the whole text;
converting the training data set into a preset vector, wherein the preset vector comprises a word vector or a word vector;
calling a preset model corresponding to the preset vector in the entity identification project according to the preset vector;
generating a prediction probability matrix according to the preset vector and the preset model;
generating a noise data index according to the prediction probability matrix and a preset frame;
determining noise data according to the prediction probability matrix, a first preset probability threshold and a second preset probability threshold;
and cleaning the noise data in the whole text according to the noise data index to generate purified data.
2. The method according to claim 1, wherein the step of invoking the preset model corresponding to the preset vector in the entity identification item according to the preset vector specifically includes:
calling a sequence model in the entity identification project under the condition that the preset vector is the word vector;
and calling a regression model in the entity recognition project under the condition that the preset vector is the word vector.
3. The method according to claim 1, wherein the step of generating a prediction probability matrix according to the predetermined vector and the predetermined model specifically comprises:
under the condition that the preset vector is the word vector, the word vector is led into the sequence model, and the prediction probability matrix of the whole text is generated in a cross validation mode;
and under the condition that the preset vector is the word vector, introducing the word vector into the regression model, and generating the prediction probability matrix of the whole text in a cross validation mode.
4. The method according to any one of claims 1 to 3, wherein the step of generating the noise data index according to the prediction probability matrix and a preset frame specifically comprises:
acquiring a data tag of the entity identification item;
and inputting the prediction probability matrix and the data label into the preset frame to generate the noise data index.
5. The method according to any one of claims 1 to 3, wherein the step of determining noise data based on the prediction probability matrix, a first predetermined probability threshold and a second predetermined probability threshold comprises:
determining a confidence joint distribution matrix according to the prediction probability matrix;
determining a noise joint probability matrix according to the confidence joint distribution matrix;
generating a cleaning rule of the noise data according to the first preset probability threshold and the second preset probability threshold;
and determining the noise data in the prediction probability matrix according to the noise joint probability matrix and the cleaning rule.
6. The method of any one of claims 1 to 3, wherein after said cleaning the noise data in the corpus of text according to the noise data index to generate sanitized data, the method further comprises:
and training the regression model by taking the purified data as a training sample.
7. The method according to any one of claims 1 to 3,
the sequence model includes at least one of: the BilSTM-CRF model and the BERT model.
8. An apparatus for cleaning noise data, the apparatus comprising:
the acquisition module is used for acquiring all texts contained in the entity identification item;
the first generation module is used for generating a training data set according to the whole text;
the conversion module is used for converting the training data set into a preset vector, wherein the preset vector comprises a word vector or a word vector;
the calling module is used for calling a preset model corresponding to the preset vector in the entity identification project according to the preset vector;
the second generation module is used for generating a prediction probability matrix according to the preset vector and the preset model;
the third generation module is used for generating a noise data index according to the prediction probability matrix and a preset frame;
the determining module is used for determining noise data according to the prediction probability matrix, a first preset probability threshold and a second preset probability threshold;
and the fourth generation module is used for cleaning the noise data in the whole text according to the noise data index to generate purified data.
9. A storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the method of any of claims 1 to 7.
10. An electronic device comprising a storage medium, a processor and a computer program stored on the storage medium and executable on the processor, wherein the processor implements the method of any one of claims 1 to 7 when executing the computer program.
CN202210382774.XA 2022-04-13 2022-04-13 Noise data cleaning method and device, storage medium and electronic equipment Active CN114861650B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210382774.XA CN114861650B (en) 2022-04-13 2022-04-13 Noise data cleaning method and device, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210382774.XA CN114861650B (en) 2022-04-13 2022-04-13 Noise data cleaning method and device, storage medium and electronic equipment

Publications (2)

Publication Number Publication Date
CN114861650A true CN114861650A (en) 2022-08-05
CN114861650B CN114861650B (en) 2024-04-26

Family

ID=82630894

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210382774.XA Active CN114861650B (en) 2022-04-13 2022-04-13 Noise data cleaning method and device, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN114861650B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030204312A1 (en) * 2002-04-30 2003-10-30 Exxonmobil Upstream Research Company Method for analyzing spatially-varying noise in seismic data using markov chains
WO2021051560A1 (en) * 2019-09-17 2021-03-25 平安科技(深圳)有限公司 Text classification method and apparatus, electronic device, and computer non-volatile readable storage medium
US20210249019A1 (en) * 2018-08-29 2021-08-12 Shenzhen Zhuiyi Technology Co., Ltd. Speech recognition method, system and storage medium
CN113571052A (en) * 2021-07-22 2021-10-29 湖北亿咖通科技有限公司 Noise extraction and instruction identification method and electronic equipment

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030204312A1 (en) * 2002-04-30 2003-10-30 Exxonmobil Upstream Research Company Method for analyzing spatially-varying noise in seismic data using markov chains
US20210249019A1 (en) * 2018-08-29 2021-08-12 Shenzhen Zhuiyi Technology Co., Ltd. Speech recognition method, system and storage medium
WO2021051560A1 (en) * 2019-09-17 2021-03-25 平安科技(深圳)有限公司 Text classification method and apparatus, electronic device, and computer non-volatile readable storage medium
CN113571052A (en) * 2021-07-22 2021-10-29 湖北亿咖通科技有限公司 Noise extraction and instruction identification method and electronic equipment

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
H. JAIR ESCALANTE: "Cleaning Training-Datasets with Noise-Aware Algorithms", 《2006 SEVENTH MEXICAN INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE》, 11 December 2006 (2006-12-11) *
余孟池: "噪声标签重标注方法", 《计算机科学》, 8 April 2020 (2020-04-08) *
吴俊;程垚;郝瀚;艾力亚尔・艾则孜;刘菲雪;苏亦坡;: "基于BERT嵌入BiLSTM-CRF模型的中文专业术语抽取研究", 情报学报, no. 04, 24 April 2020 (2020-04-24) *
翟学敏;刘渊;刘波;毕蓉蓉;: "改进的XML智能数据清洗策略", 计算机工程, no. 04, 20 February 2009 (2009-02-20) *
胡勇军;江嘉欣;常会友;: "基于LDA高频词扩展的中文短文本分类", 现代图书情报技术, no. 06, 25 June 2013 (2013-06-25) *

Also Published As

Publication number Publication date
CN114861650B (en) 2024-04-26

Similar Documents

Publication Publication Date Title
CN109242013B (en) Data labeling method and device, electronic equipment and storage medium
CN110363084A (en) A kind of class state detection method, device, storage medium and electronics
CN110222330B (en) Semantic recognition method and device, storage medium and computer equipment
CN110427627A (en) Task processing method and device based on semantic expressiveness model
CN111209478A (en) Task pushing method and device, storage medium and electronic equipment
CN114494784A (en) Deep learning model training method, image processing method and object recognition method
CN111783873A (en) Incremental naive Bayes model-based user portrait method and device
CN110990627A (en) Knowledge graph construction method and device, electronic equipment and medium
CN114492599A (en) Medical image preprocessing method and device based on Fourier domain self-adaptation
CN112084752A (en) Statement marking method, device, equipment and storage medium based on natural language
CN117709435B (en) Training method of large language model, code generation method, device and storage medium
CN111190973A (en) Method, device, equipment and storage medium for classifying statement forms
CN113761925B (en) Named entity identification method, device and equipment based on noise perception mechanism
CN110781849A (en) Image processing method, device, equipment and storage medium
CN113780365A (en) Sample generation method and device
CN111859933A (en) Training method, recognition method, device and equipment of Malay recognition model
CN113656669B (en) Label updating method and device
CN114861650B (en) Noise data cleaning method and device, storage medium and electronic equipment
CN115794054A (en) Code generation method and device, storage medium and computer equipment
CN113688232B (en) Method and device for classifying bid-inviting text, storage medium and terminal
JP2020077054A (en) Selection device and selection method
CN115357720A (en) Multi-task news classification method and device based on BERT
CN113673214A (en) Information list alignment method and device, storage medium and electronic equipment
CN112861519A (en) Medical text error correction method, device and storage medium
CN110827261A (en) Image quality detection method and device, storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant