CN111309855A - Text information processing method and system - Google Patents

Text information processing method and system Download PDF

Info

Publication number
CN111309855A
CN111309855A CN201911345064.4A CN201911345064A CN111309855A CN 111309855 A CN111309855 A CN 111309855A CN 201911345064 A CN201911345064 A CN 201911345064A CN 111309855 A CN111309855 A CN 111309855A
Authority
CN
China
Prior art keywords
vocabulary
text
approved
examined
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911345064.4A
Other languages
Chinese (zh)
Inventor
沙彩霞
张军杰
马广腾
曹晶晶
陈晨
张润
唐珩祥
余汉珍
徐国磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Bank of China Ltd
Original Assignee
Bank of China Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Bank of China Ltd filed Critical Bank of China Ltd
Priority to CN201911345064.4A priority Critical patent/CN111309855A/en
Publication of CN111309855A publication Critical patent/CN111309855A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a text information processing method and a system, wherein the method comprises the following steps: performing word segmentation processing on a text to be approved to obtain a word set comprising a plurality of words; extracting the characteristics of each vocabulary in the vocabulary set to obtain a vocabulary characteristic set; inputting the vocabulary feature set into a preset classification model for vocabulary classification, and determining whether the text to be examined and approved contains sensitive words; if the sensitive words are contained, outputting text information for indicating that the text to be examined and approved fails to pass the examination and approval; and if the sensitive words are not contained, outputting text information for indicating that the text to be examined passes the examination and approval. In the scheme, the pre-trained classification model is utilized to classify the words of the text to be approved, and whether the text to be approved contains sensitive words is determined. And text information used for indicating whether the examination and approval text passes the examination and approval is output according to the determination result, manual examination and approval are not needed, manpower and examination and approval cost are saved, examination and approval speed is increased, and examination and approval efficiency is improved.

Description

Text information processing method and system
Technical Field
The invention relates to the technical field of data processing, in particular to a text information processing method and system.
Background
With the development of the internet, various group-built applications are emerging in the software application market. When an article about the intra-group construction is released in an application about the intra-group construction, the article needs to be approved, and only the approved article can be released.
At present, articles constructed in a group are approved manually by an approver, so that the articles meeting the release requirements are screened out. However, since more and more people now use the application related to the intra-group construction, this also means that there are a large number of articles related to the intra-group construction that need to be reviewed. The manual approval mode is used, a large amount of manpower and time are consumed, the approval cost is high, the approval speed is low, and the approval efficiency is low.
Disclosure of Invention
In view of this, embodiments of the present invention provide a method and a system for processing text information, so as to solve the problems of high approval cost, slow approval speed, low approval efficiency, and the like in the existing manual approval manner.
In order to achieve the above purpose, the embodiments of the present invention provide the following technical solutions:
the first aspect of the embodiments of the present invention discloses a method for processing text information, where the method includes:
performing word segmentation processing on a text to be approved to obtain a word set comprising a plurality of words;
extracting the characteristics of each vocabulary in the vocabulary set to obtain a vocabulary characteristic set;
inputting the vocabulary feature set into a preset classification model for vocabulary classification, and determining whether the to-be-examined text contains sensitive words, wherein the classification model is obtained by training a neural network model based on sample data in advance, and the sample data comprises a screening sample vocabulary set and a sensitive sample vocabulary set;
if the to-be-examined text contains sensitive words, outputting text information for indicating that the to-be-examined and approved text fails to be examined and approved;
and if the text to be examined and approved does not contain sensitive words, outputting text information for indicating that the text to be examined and approved passes the examination and approval.
Preferably, the process of training the classification model includes:
extracting the characteristics of each screened sample vocabulary in the screened sample vocabulary set to obtain a forward characteristic set;
extracting the characteristics of each sensitive sample vocabulary in the sensitive sample vocabulary set to obtain a reverse characteristic set;
and inputting the forward characteristic set and the reverse characteristic set into a preset neural network model, and training the neural network model until the neural network model converges to obtain the classification model.
Preferably, the inputting the vocabulary feature set into a preset classification model for vocabulary classification, and determining whether the to-be-reviewed text contains sensitive words includes:
inputting the vocabulary feature set into a preset classification model for vocabulary classification, and determining the vocabulary category of each vocabulary, wherein the vocabulary category indicates whether the vocabulary is a sensitive word;
determining the number of sensitive words in the text to be examined and approved based on the vocabulary category of each vocabulary;
if the number of the sensitive words is larger than or equal to the threshold value, determining that the to-be-audited text contains the sensitive words;
and if the number of the sensitive words is smaller than the threshold value, determining that the text to be examined and approved does not contain the sensitive words.
Preferably, the word segmentation processing is performed on the text to be approved, and a vocabulary set including a plurality of vocabularies is obtained, including:
performing word segmentation on a text to be approved to obtain a plurality of first words;
for each first vocabulary, performing part-of-speech tagging and weight setting on the first vocabulary to obtain a second vocabulary;
determining a set of words that includes all of the second words.
Preferably, the extracting features of each vocabulary in the vocabulary set to obtain a vocabulary feature set includes:
performing word vector conversion on each second vocabulary to obtain a word vector corresponding to each second vocabulary;
performing dimensionality reduction on the word vector corresponding to each second vocabulary to obtain the characteristics of each second vocabulary;
and determining a vocabulary feature set according to the feature of each second vocabulary.
A second aspect of the embodiments of the present invention discloses a system for processing text information, where the system includes:
the word segmentation unit is used for performing word segmentation on the text to be approved to acquire a word set comprising a plurality of words;
the extraction unit is used for extracting the characteristics of each vocabulary in the vocabulary set and acquiring a vocabulary characteristic set;
the classification unit is used for inputting the vocabulary feature set into a preset classification model for vocabulary classification and determining whether the to-be-examined text contains sensitive words or not, the classification model is obtained by training a neural network model based on sample data in advance, and the sample data comprises a screening sample vocabulary set and a sensitive sample vocabulary set;
and the output unit is used for outputting text information used for indicating that the text to be examined and approved is not approved if the text to be examined and approved contains sensitive words, and outputting text information used for indicating that the text to be examined and approved passes the approval if the text to be examined and approved does not contain sensitive words.
Preferably, the classification unit includes:
the first extraction module is used for extracting the characteristics of each screened sample vocabulary in the screened sample vocabulary set and acquiring a forward characteristic set;
the second extraction module is used for extracting the characteristics of each sensitive sample vocabulary in the sensitive sample vocabulary set and acquiring a reverse characteristic set;
and the training module is used for inputting the forward characteristic set and the reverse characteristic set into a preset neural network model, and training the neural network model until the neural network model converges to obtain the classification model.
Preferably, the classification unit includes:
the classification module is used for inputting the vocabulary feature set into a preset classification model to perform vocabulary classification and determining the vocabulary category of each vocabulary, wherein the vocabulary category indicates whether the vocabulary is a sensitive word;
the first determining module is used for determining the number of sensitive words in the text to be examined and approved based on the vocabulary category of each vocabulary;
and the second determining module is used for determining that the to-be-examined text contains the sensitive words if the number of the sensitive words is greater than or equal to a threshold value, and determining that the to-be-examined text does not contain the sensitive words if the number of the sensitive words is less than the threshold value.
Preferably, the word segmentation unit includes:
the word segmentation module is used for segmenting words of the text to be approved and acquiring a plurality of first words;
the setting module is used for performing part-of-speech tagging and weight setting on each first vocabulary to acquire a second vocabulary;
and the determining module is used for determining a vocabulary set containing all the second vocabularies.
Preferably, the extraction unit includes:
the conversion module is used for performing word vector conversion on each second vocabulary to obtain a word vector corresponding to each second vocabulary;
the dimension reduction module is used for carrying out dimension reduction processing on the word vector corresponding to each second vocabulary to obtain the characteristics of each second vocabulary;
and the determining module is used for determining the vocabulary feature set according to the features of each second vocabulary.
Based on the method and the system for processing the text information provided by the embodiment of the invention, the method comprises the following steps: performing word segmentation processing on a text to be approved to obtain a word set comprising a plurality of words; extracting the characteristics of each vocabulary in the vocabulary set to obtain a vocabulary characteristic set; inputting the vocabulary feature set into a preset classification model for vocabulary classification, and determining whether the text to be examined and approved contains sensitive words; if the sensitive words are contained, outputting text information for indicating that the text to be examined and approved fails to pass the examination and approval; and if the sensitive words are not contained, outputting text information for indicating that the text to be examined passes the examination and approval. In the scheme, the pre-trained classification model is utilized to classify the words of the text to be approved, and whether the text to be approved contains sensitive words is determined. If the sensitive words are contained, outputting text information for indicating that the text to be examined and approved fails to pass the examination and approval; and if the sensitive words are not contained, outputting text information for indicating that the text to be examined passes the examination and approval. The examination and approval are not required to be carried out manually, the manpower and the examination and approval cost are saved, the examination and approval speed is increased, and the examination and approval efficiency is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
Fig. 1 is a flowchart of a text message processing method according to an embodiment of the present invention;
FIG. 2 is a flowchart of obtaining a vocabulary feature set according to an embodiment of the present invention;
FIG. 3 is a flowchart of training a classification model according to an embodiment of the present invention;
fig. 4 is another flowchart of a text message processing method according to an embodiment of the present invention;
fig. 5 is a block diagram of a system for processing text information according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In this application, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
As known from the background art, when articles about building in a group are approved, articles meeting release requirements are screened out in a manual approval mode. But as the number of articles about the construction within the community increases, a large number of articles need to be reviewed. The manual approval mode is used, a large amount of manpower and time are consumed, the approval cost is high, the approval speed is low, and the approval efficiency is low.
Therefore, the embodiment of the invention provides a text information processing method and system, which utilize a pre-trained classification model to classify words of a text to be examined and approved and determine whether the text to be examined and approved contains sensitive words. And outputting text information for indicating whether the document to be examined passes the examination and approval according to the determination result without manually examining and approving so as to save examination and approval cost and improve examination and approval speed and examination and approval efficiency.
Referring to fig. 1, a flowchart of a processing method for text information according to an embodiment of the present invention is shown, where the processing method includes the following steps:
step S101: and performing word segmentation on the text to be approved to obtain a word set containing a plurality of words.
In the process of specifically implementing step S101, a word segmentation component is used to segment a text to be examined and approved, and a plurality of first words are obtained, for example: and utilizing a Jieba word segmentation component to segment the text to be approved. And for each first vocabulary, performing part-of-speech tagging and weight setting on each first vocabulary to obtain a second vocabulary. A set of words is determined that includes all of the second words.
Step S102: and extracting the characteristics of each vocabulary in the vocabulary set to obtain a vocabulary characteristic set.
In the process of implementing step S102 specifically, for each vocabulary in the vocabulary set, a feature corresponding to each vocabulary in the vocabulary set is extracted, and a vocabulary feature set including the feature of each vocabulary is obtained.
Step S103: and inputting the vocabulary feature set into a preset classification model for vocabulary classification, and determining whether the text to be examined and approved contains sensitive words.
It should be noted that a large number of articles about the construction in the community are collected through a preset channel, for example, the articles in the article of reviving number one are collected. For each collected article about building in the community, performing word segmentation processing on each article about building in the community, and the way of word segmentation processing can be referred to the content in step S101. The method comprises the steps of utilizing a screening vocabulary set obtained by all articles related to the building in the cluster after word segmentation processing to construct a screening sample vocabulary set (screening sample vocabulary bank), namely, screening sample vocabularies included in the screening sample vocabulary set are vocabularies related to the building in the cluster.
Similarly, a large number of sensitive words are collected through a preset channel, and a sensitive sample word collection (sensitive sample word bank) is constructed by using the collected sensitive words.
And training a pre-constructed neural network model by using sample data comprising a screening sample vocabulary set and a sensitive sample vocabulary set until the neural network model is converged to obtain a classification model.
In the process of implementing step S103 specifically, the vocabulary feature set is input into a preset classification model for vocabulary classification, and a vocabulary category of each vocabulary is determined, where the vocabulary category is used to indicate whether the vocabulary is a sensitive word.
That is to say, the vocabulary feature set corresponding to the text to be examined is input into the preset classification model, and whether each vocabulary in the text to be examined is a sensitive word is determined.
And determining the number of sensitive words in the text to be examined and approved based on the vocabulary category of each vocabulary. And if the number of the sensitive words is larger than or equal to the threshold value, determining that the text to be examined contains the sensitive words. And if the number of the sensitive words is smaller than the threshold value, determining that the text to be examined and approved does not contain the sensitive words.
It should be noted that the threshold may be set according to practical situations, and the threshold may be set to 1, that is, as long as there is a sensitive word in the text to be examined, it is determined that the text to be examined contains the sensitive word. Correspondingly, the threshold value can also be set to be N, where N is an integer greater than 1, and if there are more than N sensitive words in the text to be examined and approved, it is determined that the text to be examined and approved contains sensitive words, and if the number of the sensitive words in the text to be examined and approved is less than N, it is determined that the text to be examined and approved does not contain sensitive words.
Step S104: and if the to-be-examined text contains the sensitive words, outputting text information for indicating that the to-be-examined and approved text fails to be examined and approved.
In the process of implementing step S104 specifically, if it is determined that the pending text contains a sensitive word, outputting text information indicating that the pending text fails to be approved, for example: and if the sensitive words are determined to be contained in the to-be-examined text, outputting text information that the to-be-examined text contains the sensitive words and fails in examination and approval.
Preferably, when the classification model is used for classifying the vocabulary in the text to be examined, the sensitive words in the text to be examined are marked to obtain the marking information of each sensitive word in the text to be examined, and the marking information is used for indicating which words in the text to be examined are sensitive words.
When the text information used for indicating that the text to be examined and approved is not approved is fed back to the user, the marking information of each sensitive word in the text to be examined and approved is also fed back to the user, so that the user can modify the text to be examined and approved subsequently.
Step S105: and if the text to be examined and approved does not contain the sensitive words, outputting text information for indicating that the text to be examined and approved passes the examination and approval.
In the process of implementing step S105 specifically, if it is determined that the text to be examined and approved does not include the sensitive word, outputting text information for indicating that the text to be examined and approved passes the examination and approval, for example: and if the text to be approved does not contain the sensitive words, outputting text information of 'approved' text.
As can be seen from the foregoing, the judgment as to whether the text to be examined contains the sensitive words is determined according to the comparison result between the number of the sensitive words and the threshold, and the threshold may be set to be an integer greater than 1. That is, even if the pending text passes the approval, there may be a very small number of sensitive words in the pending text.
Preferably, when the classification model is used for classifying the vocabulary in the text to be examined, the sensitive words in the text to be examined are marked, and the marking information of each sensitive word in the text to be examined is obtained.
When the text information used for indicating that the text to be examined passes the examination and approval is fed back to the user, the marking information of each sensitive word in the text to be examined and approved is also fed back to the user, and the user is prompted that the text to be examined and approved contains a small number of sensitive words.
Accordingly, the user may also be asked whether to further modify the pending document.
In the embodiment of the invention, a screening sample vocabulary set and a sensitive sample vocabulary set are respectively constructed according to a large number of articles and sensitive words which are collected in advance and are related to the construction in a cluster. And training a neural network model by utilizing the screening sample vocabulary set and the sensitive sample vocabulary set to obtain a classification model. And carrying out vocabulary classification on the text to be examined and approved by using the classification model, and determining whether the text to be examined and approved contains sensitive words. If the sensitive words are contained, outputting text information for indicating that the text to be examined and approved fails to pass the examination and approval; and if the sensitive words are not contained, outputting text information for indicating that the text to be examined passes the examination and approval. The examination and approval are not required to be carried out manually, the manpower and the examination and approval cost are saved, the examination and approval speed is increased, and the examination and approval efficiency is improved.
The process of acquiring a vocabulary feature set related to step S102 in fig. 1 in the embodiment of the present invention is shown in fig. 2, which is a flowchart of acquiring a vocabulary feature set provided in the embodiment of the present invention, and includes the following steps:
step S201: and performing word vector conversion on each second vocabulary to obtain a word vector corresponding to each second vocabulary.
As can be seen from the content in step S101 in fig. 1 in the embodiment of the present invention, a vocabulary set including all the second vocabularies is obtained after performing the word segmentation processing on the text to be examined and approved. In the process of implementing step S201 specifically, for each second vocabulary in the vocabulary set, each second vocabulary is converted into a corresponding word vector by using a preset word vector model, for example: each second vocabulary is converted to a corresponding Word vector using Word2 vec.
Step S202: and performing dimensionality reduction on the word vector corresponding to each second word to obtain the characteristics of each second word.
In the process of implementing step S202 specifically, the word vector corresponding to each second word is subjected to dimension reduction processing, and the feature of each second word is obtained. For example: and performing dimensionality reduction processing on the word vector corresponding to each second vocabulary by using a document topic generation model (LDA) to acquire the characteristics of each second vocabulary.
Step S203: based on the characteristics of each second vocabulary, a set of vocabulary characteristics is determined.
In the process of implementing step S203 specifically, the features corresponding to each second vocabulary are integrated, and a vocabulary feature set including the features of each second vocabulary is determined.
In the embodiment of the invention, each second vocabulary in the vocabulary set is converted into a word vector by using a word vector model, and dimension reduction processing is carried out on each word vector obtained by conversion to obtain the characteristics of each second vocabulary. And inputting the vocabulary feature set containing the features of each second vocabulary into a classification model for vocabulary classification, and determining whether the text to be examined and approved contains sensitive words, so as to determine whether the text to be examined and approved passes examination and approval, without manual examination and approval, thereby saving manpower and examination and approval costs, improving examination and approval speed and improving examination and approval efficiency.
The process of obtaining a classification model related to step S103 in fig. 1 in the above embodiment of the present invention is shown in fig. 3, which is a flowchart of training a classification model provided in the embodiment of the present invention, and includes the following steps:
step S301: and extracting the characteristics of each screened sample vocabulary in the screened sample vocabulary set to obtain a forward characteristic set.
In the process of implementing step S301 specifically, each filtered sample vocabulary is converted into a corresponding word vector by using the word vector model, and the word vector corresponding to each filtered sample vocabulary is subjected to dimension reduction processing to obtain the characteristics of each filtered sample vocabulary. And integrating the characteristics of each screened sample vocabulary to obtain a forward characteristic set containing the characteristics of all screened sample vocabularies.
For a specific process of extracting the features of the filtered sample vocabulary, reference may be made to the content shown in fig. 2 in the above embodiment of the present invention, which is not described herein again.
Step S302: and extracting the characteristics of each sensitive sample vocabulary in the sensitive sample vocabulary set to obtain a reverse characteristic set.
In the process of implementing step S302 specifically, each sensitive sample vocabulary is converted into a corresponding word vector by using the word vector model, and the word vector corresponding to each sensitive sample vocabulary is subjected to dimension reduction processing to obtain the characteristics of each sensitive sample vocabulary. And integrating the characteristics of each sensitive sample vocabulary to obtain an inverse characteristic set containing the characteristics of all the sensitive sample vocabularies.
For a specific process of extracting features of the sensitive sample vocabulary, reference may be made to the content shown in fig. 2 in the embodiment of the present invention, which is not described herein again.
It should be noted that the execution sequence of the step S301 and the step S302 is not limited to execute the step S301 and then execute the step S302, and correspondingly, the step S301 and the step S302 may be executed simultaneously, or execute the step S302 and then execute the step S301, which is not limited herein.
Step S303: and inputting the forward characteristic set and the reverse characteristic set into a preset neural network model, and training the neural network model until the neural network model converges to obtain a classification model.
In the process of specifically implementing step S303, the forward feature set and the reverse feature set are input into a preset neural network model, and the neural network model is trained by using a Support Vector Machine (SVM) algorithm until the neural network model converges, so as to obtain a classification model.
In the embodiment of the invention, the characteristics of each screened sample vocabulary in the screened sample vocabulary set are extracted to obtain the forward characteristic set. And extracting the characteristics of each sensitive sample vocabulary in the sensitive sample vocabulary set to obtain a reverse characteristic set. And training the neural network model by using the forward characteristic set and the reverse characteristic set until convergence to obtain a classification model. Whether the text to be examined and approved contains sensitive words or not is determined through the classification model obtained through training, so that whether the text to be examined and approved passes examination and approval or not is determined, manual examination and approval are not needed, manpower and examination and approval cost are saved, examination and approval speed is increased, and examination and approval efficiency is improved.
To better explain the contents shown in fig. 1 to fig. 3 of the above embodiments of the present invention, fig. 4 is used for illustration, and it should be noted that fig. 4 is used for illustration only.
Referring to fig. 4, another flowchart of a text information processing method according to an embodiment of the present invention is shown, including the following steps:
step S401: and constructing a screening sample word bank.
In the process of the concrete implementation step S401, articles related to intra-cluster construction in the rejoining one-size chapter library are collected, and word segmentation processing is performed on each article, where the word segmentation processing includes: the method comprises the steps of Jieba word segmentation, part of speech tagging and weight setting.
And constructing a screening sample word bank by using a screening word set obtained by all the articles subjected to word segmentation processing.
Step S402: and extracting the characteristics of each screening sample vocabulary in the screening sample lexicon to obtain a forward characteristic set, and extracting the characteristics of each sensitive sample vocabulary in the sensitive sample lexicon to obtain a reverse characteristic set.
In the process of implementing step S402 specifically, each filtered sample vocabulary and each sensitive sample vocabulary are converted into a Word vector using Word2vec Word vector model. And performing dimension reduction processing on the word vector corresponding to each screened sample vocabulary by using LDA to obtain a forward characteristic set. And performing dimension reduction processing on the word vector corresponding to each sensitive sample word by utilizing LDA (latent Dirichlet Allocation) to obtain a reverse feature set.
Step S403: and training the neural network model by utilizing the forward characteristic set and the reverse characteristic set in combination with an SVM algorithm until convergence to obtain a classification model.
In the specific implementation process of step S403, the forward feature set and the reverse feature set are input into a preset neural network model, and the neural network model is trained by using an SVM algorithm until convergence, so as to obtain a classification model.
Step S404: and performing word segmentation on the text to be approved to acquire a vocabulary set.
In the process of implementing step S404, the process of the word segmentation process refers to the content in step S401.
Step S405: and extracting the characteristics of each vocabulary in the vocabulary set, acquiring a vocabulary characteristic set, and inputting the vocabulary characteristic set into a preset classification model for vocabulary classification.
In the process of implementing step S405, the process of extracting the features of the vocabulary refers to the content in step S402.
Step S406: and determining whether the text to be examined and approved contains sensitive words or not according to the classification result of the classification model.
The execution principle of the steps S401 to S406 can refer to the content of each step in fig. 1 to fig. 3 in the embodiment of the present invention, and will not be described herein again.
In the embodiment of the invention, a screening sample word bank and a sensitive sample word bank are respectively constructed according to a large number of pre-collected articles and sensitive words related to the construction in the cluster. And training a neural network model by utilizing the screening sample word bank and the sensitive sample word bank to obtain a classification model. And carrying out vocabulary classification on the text to be examined and approved by using the classification model, and determining whether the text to be examined and approved contains sensitive words or not, thereby determining whether the text to be examined and approved passes the examination and approval. The examination and approval are not required to be carried out manually, the manpower and the examination and approval cost are saved, the examination and approval speed is increased, and the examination and approval efficiency is improved.
Corresponding to the method for processing text information provided in the foregoing embodiment of the present invention, referring to fig. 5, an embodiment of the present invention further provides a structural block diagram of a system for processing text information, where the system includes: a word segmentation unit 501, an extraction unit 502, a classification unit 503 and an output unit 504;
and the word segmentation unit 501 is configured to perform word segmentation on the text to be approved, and acquire a vocabulary set including a plurality of vocabularies.
The extracting unit 502 is configured to extract features of each vocabulary in the vocabulary set, and obtain a vocabulary feature set.
The classification unit 503 is configured to input the vocabulary feature set into a preset classification model to perform vocabulary classification, and determine whether the text to be approved contains sensitive words, where the classification model is obtained by training a neural network model based on sample data in advance, and the sample data includes a screening sample vocabulary set and a sensitive sample vocabulary set.
The output unit 504 is configured to output text information used for indicating that the text to be examined and approved is not approved if the text to be examined and approved contains sensitive words, and output text information used for indicating that the text to be examined and approved passes the approval if the text to be examined and approved does not contain sensitive words.
In the embodiment of the invention, a screening sample vocabulary set and a sensitive sample vocabulary set are respectively constructed according to a large number of articles and sensitive words which are collected in advance and are related to the construction in a cluster. And training a neural network model by utilizing the screening sample vocabulary set and the sensitive sample vocabulary set to obtain a classification model. And carrying out vocabulary classification on the text to be examined and approved by using the classification model, and determining whether the text to be examined and approved contains sensitive words. If the sensitive words are contained, outputting text information for indicating that the text to be examined and approved fails to pass the examination and approval; and if the sensitive words are not contained, outputting text information for indicating that the text to be examined passes the examination and approval. The examination and approval are not required to be carried out manually, the manpower and the examination and approval cost are saved, the examination and approval speed is increased, and the examination and approval efficiency is improved.
Preferably, in conjunction with the content shown in fig. 5, the classification unit 503 includes: the system comprises a first extraction module, a second extraction module and a training module, wherein the execution principle of each module is as follows:
and the first extraction module is used for extracting the characteristics of each screened sample vocabulary in the screened sample vocabulary set and acquiring the forward characteristic set.
And the second extraction module is used for extracting the characteristics of each sensitive sample vocabulary in the sensitive sample vocabulary set and acquiring a reverse characteristic set.
And the training module is used for inputting the forward characteristic set and the reverse characteristic set into a preset neural network model, and training the neural network model until the neural network model converges to obtain a classification model.
In the embodiment of the invention, the characteristics of each screened sample vocabulary in the screened sample vocabulary set are extracted to obtain the forward characteristic set. And extracting the characteristics of each sensitive sample vocabulary in the sensitive sample vocabulary set to obtain a reverse characteristic set. And training the neural network model by using the forward characteristic set and the reverse characteristic set until convergence to obtain a classification model. Whether the text to be examined and approved contains sensitive words or not is determined through the classification model obtained through training, so that whether the text to be examined and approved passes examination and approval or not is determined, manual examination and approval are not needed, manpower and examination and approval cost are saved, examination and approval speed is increased, and examination and approval efficiency is improved.
Preferably, in conjunction with the content shown in fig. 5, the classification unit 503 includes: the system comprises a classification module, a first determination module and a second determination module, wherein the execution principle of each module is as follows:
and the classification module is used for inputting the vocabulary feature set into a preset classification model to classify the vocabularies and determining the vocabulary category of each vocabulary, wherein the vocabulary category indicates whether the vocabulary is a sensitive word or not.
And the first determining module is used for determining the number of the sensitive words in the text to be examined and approved based on the vocabulary category of each vocabulary.
And the second determining module is used for determining that the text to be examined contains the sensitive words if the number of the sensitive words is greater than or equal to the threshold value, and determining that the text to be examined and approved does not contain the sensitive words if the number of the sensitive words is less than the threshold value.
In the embodiment of the invention, the vocabulary category of each vocabulary in the text to be examined is determined by using the classification model, and the number of sensitive words in the text to be examined is determined according to the vocabulary category of each vocabulary. And comparing the number of the sensitive words in the text to be examined and approved with a threshold value, and finally determining whether the text to be examined and approved contains the sensitive words, so as to determine whether the text to be examined and approved passes the examination and approval, without manual examination and approval, thereby saving manpower and examination and approval costs, improving the examination and approval speed and improving the examination and approval efficiency.
Preferably, in conjunction with the content shown in fig. 5, the word segmentation unit 501 includes: the word segmentation module, the setting module and the determining module have the following execution principles:
and the word segmentation module is used for segmenting words of the text to be approved and acquiring a plurality of first words.
And the setting module is used for performing part-of-speech tagging and weight setting on the first vocabulary aiming at each first vocabulary and acquiring a second vocabulary.
And the determining module is used for determining the vocabulary set containing all the second vocabularies.
Preferably, in conjunction with the content shown in fig. 5, the extracting unit 502 includes: the system comprises a conversion module, a dimension reduction module and a determination module, wherein the execution principle of each module is as follows:
and the conversion module is used for performing word vector conversion on each second vocabulary to obtain a word vector corresponding to each second vocabulary.
And the dimension reduction module is used for carrying out dimension reduction processing on the word vector corresponding to each second vocabulary to obtain the characteristics of each second vocabulary.
And the determining module is used for determining the vocabulary feature set according to the feature of each second vocabulary.
In the embodiment of the invention, each second vocabulary in the vocabulary set is converted into a word vector by using a word vector model, and dimension reduction processing is carried out on each word vector obtained by conversion to obtain the characteristics of each second vocabulary. And inputting the vocabulary feature set containing the features of each second vocabulary into a classification model for vocabulary classification, and determining whether the text to be examined and approved contains sensitive words, so as to determine whether the text to be examined and approved passes examination and approval, without manual examination and approval, thereby saving manpower and examination and approval costs, improving examination and approval speed and improving examination and approval efficiency.
In summary, embodiments of the present invention provide a method and a system for processing text information, where the method includes: performing word segmentation processing on a text to be approved to obtain a word set comprising a plurality of words; extracting the characteristics of each vocabulary in the vocabulary set to obtain a vocabulary characteristic set; inputting the vocabulary feature set into a preset classification model for vocabulary classification, and determining whether the text to be examined and approved contains sensitive words; if the sensitive words are contained, outputting text information for indicating that the text to be examined and approved fails to pass the examination and approval; and if the sensitive words are not contained, outputting text information for indicating that the text to be examined passes the examination and approval. In the scheme, the pre-trained classification model is utilized to classify the words of the text to be approved, and whether the text to be approved contains sensitive words is determined. If the sensitive words are contained, outputting text information for indicating that the text to be examined and approved fails to pass the examination and approval; and if the sensitive words are not contained, outputting text information for indicating that the text to be examined passes the examination and approval. The examination and approval are not required to be carried out manually, the manpower and the examination and approval cost are saved, the examination and approval speed is increased, and the examination and approval efficiency is improved.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, the system or system embodiments are substantially similar to the method embodiments and therefore are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for related points. The above-described system and system embodiments are only illustrative, wherein the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A method for processing text information, the method comprising:
performing word segmentation processing on a text to be approved to obtain a word set comprising a plurality of words;
extracting the characteristics of each vocabulary in the vocabulary set to obtain a vocabulary characteristic set;
inputting the vocabulary feature set into a preset classification model for vocabulary classification, and determining whether the to-be-examined text contains sensitive words, wherein the classification model is obtained by training a neural network model based on sample data in advance, and the sample data comprises a screening sample vocabulary set and a sensitive sample vocabulary set;
if the to-be-examined text contains sensitive words, outputting text information for indicating that the to-be-examined and approved text fails to be examined and approved;
and if the text to be examined and approved does not contain sensitive words, outputting text information for indicating that the text to be examined and approved passes the examination and approval.
2. The method of claim 1, wherein the process of training the classification model comprises:
extracting the characteristics of each screened sample vocabulary in the screened sample vocabulary set to obtain a forward characteristic set;
extracting the characteristics of each sensitive sample vocabulary in the sensitive sample vocabulary set to obtain a reverse characteristic set;
and inputting the forward characteristic set and the reverse characteristic set into a preset neural network model, and training the neural network model until the neural network model converges to obtain the classification model.
3. The method of claim 1, wherein the step of inputting the vocabulary feature set into a preset classification model for vocabulary classification and determining whether the document to be examined contains sensitive words comprises:
inputting the vocabulary feature set into a preset classification model for vocabulary classification, and determining the vocabulary category of each vocabulary, wherein the vocabulary category indicates whether the vocabulary is a sensitive word;
determining the number of sensitive words in the text to be examined and approved based on the vocabulary category of each vocabulary;
if the number of the sensitive words is larger than or equal to the threshold value, determining that the to-be-audited text contains the sensitive words;
and if the number of the sensitive words is smaller than the threshold value, determining that the text to be examined and approved does not contain the sensitive words.
4. The method according to claim 1, wherein the performing word segmentation on the text to be approved to obtain a vocabulary set including a plurality of vocabularies comprises:
performing word segmentation on a text to be approved to obtain a plurality of first words;
for each first vocabulary, performing part-of-speech tagging and weight setting on the first vocabulary to obtain a second vocabulary;
determining a set of words that includes all of the second words.
5. The method of claim 4, wherein said extracting features of each of said words in said set of words to obtain a set of word features comprises:
performing word vector conversion on each second vocabulary to obtain a word vector corresponding to each second vocabulary;
performing dimensionality reduction on the word vector corresponding to each second vocabulary to obtain the characteristics of each second vocabulary;
and determining a vocabulary feature set according to the feature of each second vocabulary.
6. A system for processing textual information, the system comprising:
the word segmentation unit is used for performing word segmentation on the text to be approved to acquire a word set comprising a plurality of words;
the extraction unit is used for extracting the characteristics of each vocabulary in the vocabulary set and acquiring a vocabulary characteristic set;
the classification unit is used for inputting the vocabulary feature set into a preset classification model for vocabulary classification and determining whether the to-be-examined text contains sensitive words or not, the classification model is obtained by training a neural network model based on sample data in advance, and the sample data comprises a screening sample vocabulary set and a sensitive sample vocabulary set;
and the output unit is used for outputting text information used for indicating that the text to be examined and approved is not approved if the text to be examined and approved contains sensitive words, and outputting text information used for indicating that the text to be examined and approved passes the approval if the text to be examined and approved does not contain sensitive words.
7. The system of claim 6, wherein the classification unit comprises:
the first extraction module is used for extracting the characteristics of each screened sample vocabulary in the screened sample vocabulary set and acquiring a forward characteristic set;
the second extraction module is used for extracting the characteristics of each sensitive sample vocabulary in the sensitive sample vocabulary set and acquiring a reverse characteristic set;
and the training module is used for inputting the forward characteristic set and the reverse characteristic set into a preset neural network model, and training the neural network model until the neural network model converges to obtain the classification model.
8. The system of claim 6, wherein the classification unit comprises:
the classification module is used for inputting the vocabulary feature set into a preset classification model to perform vocabulary classification and determining the vocabulary category of each vocabulary, wherein the vocabulary category indicates whether the vocabulary is a sensitive word;
the first determining module is used for determining the number of sensitive words in the text to be examined and approved based on the vocabulary category of each vocabulary;
and the second determining module is used for determining that the to-be-examined text contains the sensitive words if the number of the sensitive words is greater than or equal to a threshold value, and determining that the to-be-examined text does not contain the sensitive words if the number of the sensitive words is less than the threshold value.
9. The system of claim 6, wherein the word segmentation unit comprises:
the word segmentation module is used for segmenting words of the text to be approved and acquiring a plurality of first words;
the setting module is used for performing part-of-speech tagging and weight setting on each first vocabulary to acquire a second vocabulary;
and the determining module is used for determining a vocabulary set containing all the second vocabularies.
10. The system of claim 9, wherein the extraction unit comprises:
the conversion module is used for performing word vector conversion on each second vocabulary to obtain a word vector corresponding to each second vocabulary;
the dimension reduction module is used for carrying out dimension reduction processing on the word vector corresponding to each second vocabulary to obtain the characteristics of each second vocabulary;
and the determining module is used for determining the vocabulary feature set according to the features of each second vocabulary.
CN201911345064.4A 2019-12-24 2019-12-24 Text information processing method and system Pending CN111309855A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911345064.4A CN111309855A (en) 2019-12-24 2019-12-24 Text information processing method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911345064.4A CN111309855A (en) 2019-12-24 2019-12-24 Text information processing method and system

Publications (1)

Publication Number Publication Date
CN111309855A true CN111309855A (en) 2020-06-19

Family

ID=71158050

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911345064.4A Pending CN111309855A (en) 2019-12-24 2019-12-24 Text information processing method and system

Country Status (1)

Country Link
CN (1) CN111309855A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112597770A (en) * 2020-12-16 2021-04-02 盐城数智科技有限公司 Sensitive information query method based on deep learning
CN113435843A (en) * 2021-06-28 2021-09-24 平安信托有限责任公司 Batch file generation method and device, electronic equipment and storage medium
CN113779250A (en) * 2021-09-08 2021-12-10 上海松欣智能科技有限公司 Standardized text data processing system
CN115941795A (en) * 2022-03-15 2023-04-07 中移***集成有限公司 Data transmission method and device, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106445998A (en) * 2016-05-26 2017-02-22 达而观信息科技(上海)有限公司 Text content auditing method and system based on sensitive word
CN108647309A (en) * 2018-05-09 2018-10-12 达而观信息科技(上海)有限公司 Chat content checking method based on sensitive word and system
CN108984530A (en) * 2018-07-23 2018-12-11 北京信息科技大学 A kind of detection method and detection system of network sensitive content
WO2019200806A1 (en) * 2018-04-20 2019-10-24 平安科技(深圳)有限公司 Device for generating text classification model, method, and computer readable storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106445998A (en) * 2016-05-26 2017-02-22 达而观信息科技(上海)有限公司 Text content auditing method and system based on sensitive word
WO2019200806A1 (en) * 2018-04-20 2019-10-24 平安科技(深圳)有限公司 Device for generating text classification model, method, and computer readable storage medium
CN108647309A (en) * 2018-05-09 2018-10-12 达而观信息科技(上海)有限公司 Chat content checking method based on sensitive word and system
CN108984530A (en) * 2018-07-23 2018-12-11 北京信息科技大学 A kind of detection method and detection system of network sensitive content

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112597770A (en) * 2020-12-16 2021-04-02 盐城数智科技有限公司 Sensitive information query method based on deep learning
CN112597770B (en) * 2020-12-16 2024-06-11 盐城数智科技有限公司 Sensitive information query method based on deep learning
CN113435843A (en) * 2021-06-28 2021-09-24 平安信托有限责任公司 Batch file generation method and device, electronic equipment and storage medium
CN113779250A (en) * 2021-09-08 2021-12-10 上海松欣智能科技有限公司 Standardized text data processing system
CN115941795A (en) * 2022-03-15 2023-04-07 中移***集成有限公司 Data transmission method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN110209764B (en) Corpus annotation set generation method and device, electronic equipment and storage medium
CN111309855A (en) Text information processing method and system
CN110020424B (en) Contract information extraction method and device and text information extraction method
CN107590172B (en) Core content mining method and device for large-scale voice data
CN111310476B (en) Public opinion monitoring method and system using aspect-based emotion analysis method
CN107291840B (en) User attribute prediction model construction method and device
CN108416032B (en) Text classification method, device and storage medium
CN112183099A (en) Named entity identification method and system based on semi-supervised small sample extension
CN110910283A (en) Method, device, equipment and storage medium for generating legal document
CN109948160B (en) Short text classification method and device
CN103336766A (en) Short text garbage identification and modeling method and device
CN106897290B (en) Method and device for establishing keyword model
CN108052505A (en) Text emotion analysis method and device, storage medium, terminal
CN102279890A (en) Sentiment word extracting and collecting method based on micro blog
CN103593431A (en) Internet public opinion analyzing method and device
CN110674298B (en) Deep learning mixed topic model construction method
CN111930792A (en) Data resource labeling method and device, storage medium and electronic equipment
CN104573030A (en) Textual emotion prediction method and device
CN103246655A (en) Text categorizing method, device and system
CN112036705A (en) Quality inspection result data acquisition method, device and equipment
CN108228808A (en) Determine the method, apparatus of focus incident and storage medium and electronic equipment
CN110287314A (en) Long text credibility evaluation method and system based on Unsupervised clustering
CN104899310B (en) Information sorting method, the method and device for generating information sorting model
CN109918503A (en) The slot fill method of semantic feature is extracted from attention mechanism based on dynamic window
CN111241843A (en) Semantic relation inference system and method based on composite neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination