CN111209373A - Sensitive text recognition method and device based on natural semantics - Google Patents
Sensitive text recognition method and device based on natural semantics Download PDFInfo
- Publication number
- CN111209373A CN111209373A CN202010012173.0A CN202010012173A CN111209373A CN 111209373 A CN111209373 A CN 111209373A CN 202010012173 A CN202010012173 A CN 202010012173A CN 111209373 A CN111209373 A CN 111209373A
- Authority
- CN
- China
- Prior art keywords
- document
- word
- corpus
- vector library
- detected
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 40
- 239000013598 vector Substances 0.000 claims abstract description 98
- 230000011218 segmentation Effects 0.000 claims abstract description 19
- 238000004458 analytical method Methods 0.000 claims abstract description 16
- 238000012549 training Methods 0.000 claims description 14
- 238000004364 calculation method Methods 0.000 claims description 6
- 238000003058 natural language processing Methods 0.000 claims description 4
- 238000005070 sampling Methods 0.000 claims description 3
- 230000014509 gene expression Effects 0.000 description 5
- 238000012545 processing Methods 0.000 description 5
- 238000012986 modification Methods 0.000 description 4
- 230000004048 modification Effects 0.000 description 4
- 230000000694 effects Effects 0.000 description 3
- 230000001133 acceleration Effects 0.000 description 2
- 238000012937 correction Methods 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 230000001788 irregular Effects 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 238000001514 detection method Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3347—Query execution using vector based model
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/313—Selection or weighting of terms for indexing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A sensitive text recognition method and device based on natural semantics are disclosed, wherein the method comprises the following steps: acquiring a mass corpus word vector library; performing word segmentation on the sample document; performing word-by-word vectorization on the sample document, correcting the massive corpus word vector library, and establishing a small corpus word vector library; vectorization analysis is carried out on the sample document, and fingerprint characteristics of the sample document are extracted; carrying out word segmentation, word-by-word vectorization and document vectorization analysis on the documents to be detected in sequence to obtain fingerprint characteristics of the documents; and comparing the fingerprint characteristics of the document to be detected with the fingerprint characteristics of the sample document, and identifying the sensitive document to be detected. The method and the device have the advantages of low missing report rate, difficulty in avoiding, high identification efficiency and the like.
Description
Technical Field
The invention relates to the field of computer information processing, in particular to a sensitive text recognition technology based on natural semantic features.
Background
The traditional sensitive file identification method based on keywords has the advantages that the strategy setting is simple and visual, on the other hand, the defects of high false and missed report rate exist, and the method is easy to avoid; the sensitive file identification method based on file hash has the advantages of high processing speed and has the defects of missing report caused by avalanche effect due to small disturbance such as modification of paragraph sequence or modification of character expression, and the similarity between documents cannot be really judged from the semantic connotation.
In addition, the traditional method is often limited to the bottleneck of flow and performance, and cannot achieve balance among quick identification, high accuracy, low false alarm rate and low false alarm rate.
Disclosure of Invention
The invention provides a sensitive text recognition method and device based on natural semantics, which can realize the discrimination of similarity between documents from the aspect of semantic connotation, have low false and missing report rate, are difficult to avoid and have higher processing efficiency.
The invention provides a sensitive text recognition method based on natural semantics, which comprises the following steps:
acquiring a massive corpus word vector library based on natural corpus;
performing word segmentation on the sample document;
performing word-by-word vectorization on the sample document, correcting the massive corpus vector library, and establishing a small corpus word vector library based on the new words in the sample document;
based on the corrected mass corpus word vector library and the corrected small corpus word vector library, performing vectorization analysis on the sample document, and extracting fingerprint characteristics of the sample document;
carrying out word segmentation, word-by-word vectorization and document vectorization analysis on the documents to be detected in sequence to obtain fingerprint characteristics of the documents;
and comparing the fingerprint characteristics of the document to be detected with the fingerprint characteristics of the sample document, and identifying the sensitive document to be detected.
Optionally, the method for obtaining a mass corpus word vector library based on natural corpus includes:
and training the natural corpus by using a natural language processing model to obtain the massive corpus word vector library.
Optionally, the word-by-word vectorization of the sample document, the correction of the mass word vector library, and the establishment of the corpus word vector library based on the new words in the sample document include:
taking all vocabularies obtained after word segmentation of the sample document as input, and positioning word vectors of the words in the massive corpus word vector library one by one;
correcting the word vector of the word according to the relation between the word and other words in the sample document;
and analyzing word vectors of the new words which are not in the massive corpus word vector library according to the context of the document where the words are located to obtain word vectors, storing the word vectors into a small corpus word vector library, and continuously updating the small corpus word vector library along with the addition of new words.
Optionally, a skip-gram model in Word2Vec using negative sampling optimization acceleration is adopted to train the natural corpus to obtain the mass corpus Word vector library, or the mass corpus Word vector library is corrected.
Optionally, the new words not in the mass corpus word vector library are subjected to word vector analysis according to the context of the document where the word is located by adopting a Nonce2Vec method to obtain word vectors.
Optionally, a SIF algorithm is adopted, and based on the corrected mass corpus word vector library and the corrected small corpus word vector library, vectorization analysis is performed on the sample document, so as to extract fingerprint features of the sample document.
Optionally, the comparing the fingerprint characteristics of the document to be detected and the sample document to identify a sensitive document to be detected includes:
calculating the cosine similarity of the fingerprints of the document to be detected and the sample document;
and if the similarity is higher than the threshold value, classifying the current document to be detected as a sensitive document.
Optionally, after comparing the fingerprint characteristics of the document to be detected with the fingerprint characteristics of the sample document and identifying the sensitive document to be detected, the method further includes:
and outputting the sample document sequence number corresponding to the document to be detected classified as the sensitive document, and finishing the alarm.
In another aspect, the present disclosure also provides a sensitive text recognition apparatus, including:
the word segmentation module is used for carrying out Chinese word segmentation on the sample document and the document to be detected;
the word vectorization module is used for carrying out word-by-word vectorization on the sample document based on the massive corpus word vector library;
the document fingerprint calculation module is used for respectively carrying out vectorization analysis on the sample document and the document to be detected and extracting fingerprint characteristics of the sample document and the document to be detected;
and the document fingerprint similarity calculation module is used for comparing the fingerprint characteristics of the document to be detected with the sample document and identifying the sensitive document to be detected.
Optionally, the sensitive text recognition apparatus further includes:
and the natural corpus pre-training module is used for training the natural corpus to obtain a massive corpus word vector library.
The method and the device for identifying the sensitive text based on the natural semantics utilize word vectors to generate characteristic fingerprints of the document from a natural semantic level, and further identify the text to be detected containing sensitive information through comparison of the fingerprints, wherein the fingerprints contain the semantics and subject information of the document and are difficult to avoid through conventional means, so that the false missing report rate is effectively reduced, and meanwhile, the system has good processing efficiency due to pre-training of massive linguistic data and the adoption of an efficient algorithm.
Drawings
The foregoing and other objects, features and advantages of the disclosure will be apparent from the following more particular descriptions of exemplary embodiments of the disclosure as illustrated in the accompanying drawings wherein like reference numbers generally represent like parts throughout the exemplary embodiments of the disclosure.
FIG. 1 shows a flow diagram of a sensitive text recognition method according to an example embodiment of the present disclosure.
Detailed Description
Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
FIG. 1 shows a flowchart of an exemplary embodiment of a method for natural semantic based sensitive text recognition according to the present disclosure, which includes:
step S101: acquiring a massive corpus word vector library based on natural corpus;
step S102: performing word segmentation on the sample document;
step S103: performing word-by-word vectorization on the sample document, correcting the massive corpus vector library, and establishing a small corpus word vector library based on the new words in the sample document;
step S104: based on the corrected mass corpus word vector library and the corrected small corpus word vector library, performing vectorization analysis on the sample document, and extracting fingerprint characteristics of the sample document;
step S105: carrying out word segmentation, word-by-word vectorization and document vectorization analysis on the documents to be detected in sequence to obtain fingerprint characteristics of the documents;
step S106: and comparing the fingerprint characteristics of the document to be detected with the fingerprint characteristics of the sample document, and identifying the sensitive document to be detected.
In the above exemplary embodiment, the sample document is an existing known sensitive file, after performing chinese word segmentation on the document, all words are used as input, and word-by-word vectorization is performed on the words input by the sample document by relying on a corpus of a natural language processing model, and then a vector for characterizing the document, i.e., a document fingerprint, is generated. And finally, comparing the fingerprint similarity of the sample document and the document to be detected, and judging whether the document to be detected is a sensitive document.
A word vector (word embedding) is a vectorized representation of a word. The method and the device perform word-by-word vectorization based on natural semantics, and can indicate semantic features inside the document according to the vector relation between words and expressions in natural linguistic data.
For example, if a word "search engine" and "***" or "Baidu" always appear at the same time in the corpus, the word "***" and "Baidu" are mapped to relatively similar spatial positions when the word is vectorized; meanwhile, the number of co-occurrence times of "***" and "usa" is larger, so that the distance between "***" and other word vectors co-occurring with "usa" (such as "california" and "apple") is closer, and the distance between "Baidu" and other word vectors co-occurring with "china" is closer. Therefore, the replacement of synonyms or synonyms in sensitive documents, such as "secret", etc., can be accurately identified, and even two sentences expressing similar meanings, such as "Wang will attend a meeting on the day of worship" and "Sunday, Wang will appear at a meeting".
The method is characterized in that a pre-trained massive corpus word vector library is directly obtained as a basis, the system operation efficiency is undoubtedly greatly improved, further, the massive corpus word vector library is continuously corrected based on the actual word segmentation result of a sample document, meanwhile, a small corpus word vector library is established for irregular words in the sample document, namely, new words, and the semantic features of the sample document can be completely expressed by combining the small corpus word vector library and the irregular words.
Based on the above, the characteristic fingerprint of the document is further generated, and the comparison of the similarity of the fingerprint is carried out, so that the identification of the sensitive document can be carried out.
Optionally, the method for obtaining a mass corpus word vector library based on natural corpus includes:
and training the natural corpus by using a natural language processing model to obtain the massive corpus word vector library.
The present disclosure builds a language model for Chinese from the perspective of natural semantics, which may employ entry documents from Chinese wiki across various industries. The mass corpus word vector library can be obtained through the pre-training mode, and the existing word vector library based on natural corpus can also be directly obtained through the means of importing and the like.
Optionally, the word-by-word vectorization of the sample document, the correction of the mass corpus word vector library, and the establishment of the small corpus word vector library based on the new words in the sample document include:
taking all vocabularies obtained after word segmentation of the sample document as input, and positioning word vectors of the words in the massive corpus word vector library one by one;
correcting the word vector of the word according to the relation between the word and other words in the sample document;
and analyzing word vectors of the new words which are not in the massive corpus word vector library according to the context of the document where the words are located to obtain word vectors, storing the word vectors into a small corpus word vector library, and continuously updating the small corpus word vector library along with the addition of new words.
Optionally, a skip-gram model in Word2Vec using negative sampling optimization acceleration is adopted to train the natural corpus to obtain the mass corpus Word vector library, or the mass corpus Word vector library is corrected.
Optionally, the new words not in the mass corpus word vector library are subjected to word vector analysis according to the context of the document where the word is located by adopting a Nonce2Vec method to obtain word vectors.
The method for processing the oov (out of vocabularies) words is to select a fast and efficient low-dimensional word vector processing model. The Nonce2Vec method can be selected to realize the instant and fast training of the new words.
Optionally, a SIF algorithm is adopted, and based on the corrected mass corpus word vector library and the corrected small corpus word vector library, vectorization analysis is performed on the sample document, so as to extract fingerprint features of the sample document.
Optionally, the comparing the fingerprint characteristics of the document to be detected and the sample document to identify a sensitive document to be detected includes:
calculating the cosine similarity of the fingerprints of the document to be detected and the sample document;
and if the similarity is higher than the threshold value, classifying the current document to be detected as a sensitive document.
Optionally, after comparing the fingerprint characteristics of the document to be detected with the fingerprint characteristics of the sample document and identifying the sensitive document to be detected, the method further includes:
and outputting the sample document sequence number corresponding to the document to be detected classified as the sensitive document, and finishing the alarm.
Optionally, the comparing the fingerprint characteristics of the document to be detected and the sample document to identify a sensitive document to be detected includes:
calculating the cosine similarity of the fingerprints of the document to be detected and the sample document;
and if the similarity is higher than the threshold value, classifying the current document to be detected as a sensitive document.
In addition, optionally, after comparing the fingerprint characteristics of the document to be detected with the fingerprint characteristics of the sample document and identifying the sensitive document to be detected, the method further includes:
step S201: and outputting the sample document sequence number corresponding to the document to be detected classified as the sensitive document, and finishing the alarm.
The sensitive text recognition apparatus based on natural semantics according to an exemplary embodiment includes:
the word segmentation module is used for carrying out Chinese word segmentation on the sample document and the document to be detected;
the word vectorization module is used for carrying out word-by-word vectorization on the sample document based on the massive corpus word vector library;
the document fingerprint calculation module is used for respectively carrying out vectorization analysis on the sample document and the document to be detected and extracting fingerprint characteristics of the sample document and the document to be detected;
and the document fingerprint similarity calculation module is used for comparing the fingerprint characteristics of the document to be detected with the sample document and identifying the sensitive document to be detected.
Optionally, the sensitive text recognition apparatus further includes:
and the natural corpus pre-training module is used for training the natural corpus to obtain a massive corpus word vector library.
According to the method and the device for identifying the sensitive text based on the natural semantics, word vectors are utilized to generate characteristic fingerprints of the document from a natural semantic level, and then the text to be detected containing sensitive information is identified through comparison of the fingerprints, compared with the prior art, the method has the advantages that ① low missing report rate is achieved, the method is based on semantic level modeling, linguistic information including but not limited to synonyms, similar words, grammar, sentence patterns and the like can be identified, even if the word expression or paragraph sequence is modified, whether the document is similar to a sample document can be accurately obtained, detection through modification sequence and expression avoidance is blocked, ② efficiency and effect are excellent, training speed of the document vectorization on the new words is high, comparison effect of similarity is superior to that of a plurality of advanced neural network models (such as a plurality of RNN and LSTM models), ③ cross-field is achieved, the model based on massive Chinese corpus training covers most semantic information of various industry fields, ④ long and short documents are applicable, and the papers or the messages of a plurality of crosses have unusual expressions.
The foregoing is illustrative of the present invention and various modifications and changes in form or detail will readily occur to those skilled in the art based upon the teachings herein and the application of the principles and principles disclosed herein, which are to be regarded as illustrative rather than restrictive on the broad principles of the present invention.
Claims (10)
1. A sensitive text recognition method based on natural semantics is characterized by comprising the following steps:
acquiring a massive corpus word vector library based on natural corpus;
performing word segmentation on the sample document;
performing word-by-word vectorization on the sample document, correcting the massive corpus word vector library, and establishing a small corpus word vector library based on the new words in the sample document;
based on the corrected mass corpus word vector library and the corrected small corpus word vector library, performing vectorization analysis on the sample document, and extracting fingerprint characteristics of the sample document;
carrying out word segmentation, word-by-word vectorization and document vectorization analysis on the documents to be detected in sequence to obtain fingerprint characteristics of the documents;
and comparing the fingerprint characteristics of the document to be detected with the fingerprint characteristics of the sample document, and identifying the sensitive document to be detected.
2. The sensitive text recognition method according to claim 1, wherein the method for obtaining a mass corpus word vector library based on natural corpus comprises:
and training the natural corpus by using a natural language processing model to obtain the massive corpus word vector library.
3. The method according to claim 1, wherein said vectorizing word by word of the sample document, modifying said corpus word vector library, and establishing a corpus word vector library based on the new words in the sample document comprises:
taking all vocabularies obtained after word segmentation of the sample document as input, and positioning word vectors of the words in the massive corpus word vector library one by one;
correcting the word vector of the word according to the relation between the word and other words in the sample document;
and analyzing word vectors of the new words which are not in the massive corpus word vector library according to the context of the document where the words are located to obtain word vectors, storing the word vectors into a small corpus word vector library, and continuously updating the small corpus word vector library along with the addition of new words.
4. The sensitive text recognition method according to claim 1 or 2, wherein a skip-gram model optimized and accelerated by using negative sampling in Word2Vec is adopted to train a natural corpus to obtain the mass corpus Word vector library, or the mass corpus Word vector library is corrected.
5. The sensitive text recognition method of claim 3, wherein: and analyzing the word vectors of the new words not in the massive corpus word vector library by adopting a Nonce2Vec method according to the context of the document where the words are located to obtain word vectors.
6. The sensitive text recognition method according to claim 1, wherein an SIF algorithm is adopted, and a sample document is subjected to vectorization analysis based on the corrected mass corpus word vector library and the corrected small corpus word vector library to extract fingerprint features of the sample document.
7. The sensitive text identification method according to claim 1, wherein the comparing the fingerprint characteristics of the document to be detected and the sample document to identify the sensitive document to be detected comprises:
calculating the cosine similarity of the fingerprints of the document to be detected and the sample document;
and if the similarity is higher than the threshold value, classifying the current document to be detected as a sensitive document.
8. The method of claim 1, wherein after comparing the fingerprint characteristics of the document to be tested with the fingerprint characteristics of the sample document and identifying the sensitive document to be tested, the method further comprises:
and outputting the sample document sequence number corresponding to the document to be detected classified as the sensitive document, and finishing the alarm.
9. A sensitive text recognition apparatus applying the sensitive text recognition method according to any one of claims 1 to 8, comprising:
the word segmentation module is used for carrying out Chinese word segmentation on the sample document and the document to be detected;
the word vectorization module is used for carrying out word-by-word vectorization on the sample document based on the massive corpus word vector library;
the document fingerprint calculation module is used for respectively carrying out vectorization analysis on the sample document and the document to be detected and extracting fingerprint characteristics of the sample document and the document to be detected;
and the document fingerprint similarity calculation module is used for comparing the fingerprint characteristics of the document to be detected with the sample document and identifying the sensitive document to be detected.
10. The sensitive text recognition apparatus of claim 9, further comprising:
and the natural corpus pre-training module is used for training the natural corpus to obtain a massive corpus word vector library.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010012173.0A CN111209373A (en) | 2020-01-07 | 2020-01-07 | Sensitive text recognition method and device based on natural semantics |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010012173.0A CN111209373A (en) | 2020-01-07 | 2020-01-07 | Sensitive text recognition method and device based on natural semantics |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111209373A true CN111209373A (en) | 2020-05-29 |
Family
ID=70788651
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010012173.0A Pending CN111209373A (en) | 2020-01-07 | 2020-01-07 | Sensitive text recognition method and device based on natural semantics |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111209373A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116305285A (en) * | 2023-03-30 | 2023-06-23 | 肇庆学院 | Patient information desensitization processing method and system combining artificial intelligence |
CN117349407A (en) * | 2023-12-04 | 2024-01-05 | 江苏君立华域信息安全技术股份有限公司 | Automatic detection method and system for content security |
CN117993018A (en) * | 2024-03-29 | 2024-05-07 | 蚂蚁科技集团股份有限公司 | Access method of third party large language model and gateway server |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070143322A1 (en) * | 2005-12-15 | 2007-06-21 | International Business Machines Corporation | Document comparision using multiple similarity measures |
CN106484664A (en) * | 2016-10-21 | 2017-03-08 | 竹间智能科技(上海)有限公司 | Similarity calculating method between a kind of short text |
CN106776534A (en) * | 2016-11-11 | 2017-05-31 | 北京工商大学 | The incremental learning method of term vector model |
CN109101476A (en) * | 2017-06-21 | 2018-12-28 | 阿里巴巴集团控股有限公司 | A kind of term vector generates, data processing method and device |
CN109344407A (en) * | 2018-10-29 | 2019-02-15 | 北京天融信网络安全技术有限公司 | Semantic-based document fingerprint construction method, storage medium and computer equipment |
CN109766525A (en) * | 2019-01-14 | 2019-05-17 | 湖南大学 | A kind of sensitive information leakage detection framework of data-driven |
-
2020
- 2020-01-07 CN CN202010012173.0A patent/CN111209373A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070143322A1 (en) * | 2005-12-15 | 2007-06-21 | International Business Machines Corporation | Document comparision using multiple similarity measures |
CN106484664A (en) * | 2016-10-21 | 2017-03-08 | 竹间智能科技(上海)有限公司 | Similarity calculating method between a kind of short text |
CN106776534A (en) * | 2016-11-11 | 2017-05-31 | 北京工商大学 | The incremental learning method of term vector model |
CN109101476A (en) * | 2017-06-21 | 2018-12-28 | 阿里巴巴集团控股有限公司 | A kind of term vector generates, data processing method and device |
CN109344407A (en) * | 2018-10-29 | 2019-02-15 | 北京天融信网络安全技术有限公司 | Semantic-based document fingerprint construction method, storage medium and computer equipment |
CN109766525A (en) * | 2019-01-14 | 2019-05-17 | 湖南大学 | A kind of sensitive information leakage detection framework of data-driven |
Non-Patent Citations (1)
Title |
---|
姜雪: ""基于simhash的文本相似检测算法研究"", pages 6 - 18 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116305285A (en) * | 2023-03-30 | 2023-06-23 | 肇庆学院 | Patient information desensitization processing method and system combining artificial intelligence |
CN116305285B (en) * | 2023-03-30 | 2024-04-05 | 肇庆学院 | Patient information desensitization processing method and system combining artificial intelligence |
CN117349407A (en) * | 2023-12-04 | 2024-01-05 | 江苏君立华域信息安全技术股份有限公司 | Automatic detection method and system for content security |
CN117349407B (en) * | 2023-12-04 | 2024-01-30 | 江苏君立华域信息安全技术股份有限公司 | Automatic detection method and system for content security |
CN117993018A (en) * | 2024-03-29 | 2024-05-07 | 蚂蚁科技集团股份有限公司 | Access method of third party large language model and gateway server |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109977416B (en) | Multi-level natural language anti-spam text method and system | |
CN110096570B (en) | Intention identification method and device applied to intelligent customer service robot | |
CN108304372B (en) | Entity extraction method and device, computer equipment and storage medium | |
US11544459B2 (en) | Method and apparatus for determining feature words and server | |
CN107193796B (en) | Public opinion event detection method and device | |
KR20110038474A (en) | Apparatus and method for detecting sentence boundaries | |
CN111209373A (en) | Sensitive text recognition method and device based on natural semantics | |
CN113742733B (en) | Method and device for extracting trigger words of reading and understanding vulnerability event and identifying vulnerability type | |
CN116127953B (en) | Chinese spelling error correction method, device and medium based on contrast learning | |
CN112287100A (en) | Text recognition method, spelling error correction method and voice recognition method | |
CN112468659A (en) | Quality evaluation method, device, equipment and storage medium applied to telephone customer service | |
CN115080750B (en) | Weak supervision text classification method, system and device based on fusion prompt sequence | |
CN114416979A (en) | Text query method, text query equipment and storage medium | |
CN115759071A (en) | Government affair sensitive information identification system and method based on big data | |
CN116150651A (en) | AI-based depth synthesis detection method and system | |
CN113486174B (en) | Model training, reading understanding method and device, electronic equipment and storage medium | |
Birla | A robust unsupervised pattern discovery and clustering of speech signals | |
CN112784601A (en) | Key information extraction method and device, electronic equipment and storage medium | |
CN110377753B (en) | Relation extraction method and device based on relation trigger word and GRU model | |
CN112528653A (en) | Short text entity identification method and system | |
CN115858776B (en) | Variant text classification recognition method, system, storage medium and electronic equipment | |
CN115630357B (en) | Method for judging behavior of collecting personal information by application program crossing boundary | |
CN116305257A (en) | Privacy information monitoring device and privacy information monitoring method | |
CN115983266A (en) | Pinyin variant text identification method and system for checking credit investigation data of bank | |
CN116070642A (en) | Text emotion analysis method and related device based on expression embedding |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |