CN111209373A - Sensitive text recognition method and device based on natural semantics - Google Patents

Sensitive text recognition method and device based on natural semantics Download PDF

Info

Publication number
CN111209373A
CN111209373A CN202010012173.0A CN202010012173A CN111209373A CN 111209373 A CN111209373 A CN 111209373A CN 202010012173 A CN202010012173 A CN 202010012173A CN 111209373 A CN111209373 A CN 111209373A
Authority
CN
China
Prior art keywords
document
word
corpus
vector library
detected
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010012173.0A
Other languages
Chinese (zh)
Inventor
万淼
孙彦芬
王歆怡
陈锦
王禹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Venus Information Security Technology Co Ltd
China Information Technology Security Evaluation Center
Original Assignee
Beijing Venus Information Security Technology Co Ltd
China Information Technology Security Evaluation Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Venus Information Security Technology Co Ltd, China Information Technology Security Evaluation Center filed Critical Beijing Venus Information Security Technology Co Ltd
Priority to CN202010012173.0A priority Critical patent/CN111209373A/en
Publication of CN111209373A publication Critical patent/CN111209373A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3347Query execution using vector based model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A sensitive text recognition method and device based on natural semantics are disclosed, wherein the method comprises the following steps: acquiring a mass corpus word vector library; performing word segmentation on the sample document; performing word-by-word vectorization on the sample document, correcting the massive corpus word vector library, and establishing a small corpus word vector library; vectorization analysis is carried out on the sample document, and fingerprint characteristics of the sample document are extracted; carrying out word segmentation, word-by-word vectorization and document vectorization analysis on the documents to be detected in sequence to obtain fingerprint characteristics of the documents; and comparing the fingerprint characteristics of the document to be detected with the fingerprint characteristics of the sample document, and identifying the sensitive document to be detected. The method and the device have the advantages of low missing report rate, difficulty in avoiding, high identification efficiency and the like.

Description

Sensitive text recognition method and device based on natural semantics
Technical Field
The invention relates to the field of computer information processing, in particular to a sensitive text recognition technology based on natural semantic features.
Background
The traditional sensitive file identification method based on keywords has the advantages that the strategy setting is simple and visual, on the other hand, the defects of high false and missed report rate exist, and the method is easy to avoid; the sensitive file identification method based on file hash has the advantages of high processing speed and has the defects of missing report caused by avalanche effect due to small disturbance such as modification of paragraph sequence or modification of character expression, and the similarity between documents cannot be really judged from the semantic connotation.
In addition, the traditional method is often limited to the bottleneck of flow and performance, and cannot achieve balance among quick identification, high accuracy, low false alarm rate and low false alarm rate.
Disclosure of Invention
The invention provides a sensitive text recognition method and device based on natural semantics, which can realize the discrimination of similarity between documents from the aspect of semantic connotation, have low false and missing report rate, are difficult to avoid and have higher processing efficiency.
The invention provides a sensitive text recognition method based on natural semantics, which comprises the following steps:
acquiring a massive corpus word vector library based on natural corpus;
performing word segmentation on the sample document;
performing word-by-word vectorization on the sample document, correcting the massive corpus vector library, and establishing a small corpus word vector library based on the new words in the sample document;
based on the corrected mass corpus word vector library and the corrected small corpus word vector library, performing vectorization analysis on the sample document, and extracting fingerprint characteristics of the sample document;
carrying out word segmentation, word-by-word vectorization and document vectorization analysis on the documents to be detected in sequence to obtain fingerprint characteristics of the documents;
and comparing the fingerprint characteristics of the document to be detected with the fingerprint characteristics of the sample document, and identifying the sensitive document to be detected.
Optionally, the method for obtaining a mass corpus word vector library based on natural corpus includes:
and training the natural corpus by using a natural language processing model to obtain the massive corpus word vector library.
Optionally, the word-by-word vectorization of the sample document, the correction of the mass word vector library, and the establishment of the corpus word vector library based on the new words in the sample document include:
taking all vocabularies obtained after word segmentation of the sample document as input, and positioning word vectors of the words in the massive corpus word vector library one by one;
correcting the word vector of the word according to the relation between the word and other words in the sample document;
and analyzing word vectors of the new words which are not in the massive corpus word vector library according to the context of the document where the words are located to obtain word vectors, storing the word vectors into a small corpus word vector library, and continuously updating the small corpus word vector library along with the addition of new words.
Optionally, a skip-gram model in Word2Vec using negative sampling optimization acceleration is adopted to train the natural corpus to obtain the mass corpus Word vector library, or the mass corpus Word vector library is corrected.
Optionally, the new words not in the mass corpus word vector library are subjected to word vector analysis according to the context of the document where the word is located by adopting a Nonce2Vec method to obtain word vectors.
Optionally, a SIF algorithm is adopted, and based on the corrected mass corpus word vector library and the corrected small corpus word vector library, vectorization analysis is performed on the sample document, so as to extract fingerprint features of the sample document.
Optionally, the comparing the fingerprint characteristics of the document to be detected and the sample document to identify a sensitive document to be detected includes:
calculating the cosine similarity of the fingerprints of the document to be detected and the sample document;
and if the similarity is higher than the threshold value, classifying the current document to be detected as a sensitive document.
Optionally, after comparing the fingerprint characteristics of the document to be detected with the fingerprint characteristics of the sample document and identifying the sensitive document to be detected, the method further includes:
and outputting the sample document sequence number corresponding to the document to be detected classified as the sensitive document, and finishing the alarm.
In another aspect, the present disclosure also provides a sensitive text recognition apparatus, including:
the word segmentation module is used for carrying out Chinese word segmentation on the sample document and the document to be detected;
the word vectorization module is used for carrying out word-by-word vectorization on the sample document based on the massive corpus word vector library;
the document fingerprint calculation module is used for respectively carrying out vectorization analysis on the sample document and the document to be detected and extracting fingerprint characteristics of the sample document and the document to be detected;
and the document fingerprint similarity calculation module is used for comparing the fingerprint characteristics of the document to be detected with the sample document and identifying the sensitive document to be detected.
Optionally, the sensitive text recognition apparatus further includes:
and the natural corpus pre-training module is used for training the natural corpus to obtain a massive corpus word vector library.
The method and the device for identifying the sensitive text based on the natural semantics utilize word vectors to generate characteristic fingerprints of the document from a natural semantic level, and further identify the text to be detected containing sensitive information through comparison of the fingerprints, wherein the fingerprints contain the semantics and subject information of the document and are difficult to avoid through conventional means, so that the false missing report rate is effectively reduced, and meanwhile, the system has good processing efficiency due to pre-training of massive linguistic data and the adoption of an efficient algorithm.
Drawings
The foregoing and other objects, features and advantages of the disclosure will be apparent from the following more particular descriptions of exemplary embodiments of the disclosure as illustrated in the accompanying drawings wherein like reference numbers generally represent like parts throughout the exemplary embodiments of the disclosure.
FIG. 1 shows a flow diagram of a sensitive text recognition method according to an example embodiment of the present disclosure.
Detailed Description
Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
FIG. 1 shows a flowchart of an exemplary embodiment of a method for natural semantic based sensitive text recognition according to the present disclosure, which includes:
step S101: acquiring a massive corpus word vector library based on natural corpus;
step S102: performing word segmentation on the sample document;
step S103: performing word-by-word vectorization on the sample document, correcting the massive corpus vector library, and establishing a small corpus word vector library based on the new words in the sample document;
step S104: based on the corrected mass corpus word vector library and the corrected small corpus word vector library, performing vectorization analysis on the sample document, and extracting fingerprint characteristics of the sample document;
step S105: carrying out word segmentation, word-by-word vectorization and document vectorization analysis on the documents to be detected in sequence to obtain fingerprint characteristics of the documents;
step S106: and comparing the fingerprint characteristics of the document to be detected with the fingerprint characteristics of the sample document, and identifying the sensitive document to be detected.
In the above exemplary embodiment, the sample document is an existing known sensitive file, after performing chinese word segmentation on the document, all words are used as input, and word-by-word vectorization is performed on the words input by the sample document by relying on a corpus of a natural language processing model, and then a vector for characterizing the document, i.e., a document fingerprint, is generated. And finally, comparing the fingerprint similarity of the sample document and the document to be detected, and judging whether the document to be detected is a sensitive document.
A word vector (word embedding) is a vectorized representation of a word. The method and the device perform word-by-word vectorization based on natural semantics, and can indicate semantic features inside the document according to the vector relation between words and expressions in natural linguistic data.
For example, if a word "search engine" and "***" or "Baidu" always appear at the same time in the corpus, the word "***" and "Baidu" are mapped to relatively similar spatial positions when the word is vectorized; meanwhile, the number of co-occurrence times of "***" and "usa" is larger, so that the distance between "***" and other word vectors co-occurring with "usa" (such as "california" and "apple") is closer, and the distance between "Baidu" and other word vectors co-occurring with "china" is closer. Therefore, the replacement of synonyms or synonyms in sensitive documents, such as "secret", etc., can be accurately identified, and even two sentences expressing similar meanings, such as "Wang will attend a meeting on the day of worship" and "Sunday, Wang will appear at a meeting".
The method is characterized in that a pre-trained massive corpus word vector library is directly obtained as a basis, the system operation efficiency is undoubtedly greatly improved, further, the massive corpus word vector library is continuously corrected based on the actual word segmentation result of a sample document, meanwhile, a small corpus word vector library is established for irregular words in the sample document, namely, new words, and the semantic features of the sample document can be completely expressed by combining the small corpus word vector library and the irregular words.
Based on the above, the characteristic fingerprint of the document is further generated, and the comparison of the similarity of the fingerprint is carried out, so that the identification of the sensitive document can be carried out.
Optionally, the method for obtaining a mass corpus word vector library based on natural corpus includes:
and training the natural corpus by using a natural language processing model to obtain the massive corpus word vector library.
The present disclosure builds a language model for Chinese from the perspective of natural semantics, which may employ entry documents from Chinese wiki across various industries. The mass corpus word vector library can be obtained through the pre-training mode, and the existing word vector library based on natural corpus can also be directly obtained through the means of importing and the like.
Optionally, the word-by-word vectorization of the sample document, the correction of the mass corpus word vector library, and the establishment of the small corpus word vector library based on the new words in the sample document include:
taking all vocabularies obtained after word segmentation of the sample document as input, and positioning word vectors of the words in the massive corpus word vector library one by one;
correcting the word vector of the word according to the relation between the word and other words in the sample document;
and analyzing word vectors of the new words which are not in the massive corpus word vector library according to the context of the document where the words are located to obtain word vectors, storing the word vectors into a small corpus word vector library, and continuously updating the small corpus word vector library along with the addition of new words.
Optionally, a skip-gram model in Word2Vec using negative sampling optimization acceleration is adopted to train the natural corpus to obtain the mass corpus Word vector library, or the mass corpus Word vector library is corrected.
Optionally, the new words not in the mass corpus word vector library are subjected to word vector analysis according to the context of the document where the word is located by adopting a Nonce2Vec method to obtain word vectors.
The method for processing the oov (out of vocabularies) words is to select a fast and efficient low-dimensional word vector processing model. The Nonce2Vec method can be selected to realize the instant and fast training of the new words.
Optionally, a SIF algorithm is adopted, and based on the corrected mass corpus word vector library and the corrected small corpus word vector library, vectorization analysis is performed on the sample document, so as to extract fingerprint features of the sample document.
Optionally, the comparing the fingerprint characteristics of the document to be detected and the sample document to identify a sensitive document to be detected includes:
calculating the cosine similarity of the fingerprints of the document to be detected and the sample document;
and if the similarity is higher than the threshold value, classifying the current document to be detected as a sensitive document.
Optionally, after comparing the fingerprint characteristics of the document to be detected with the fingerprint characteristics of the sample document and identifying the sensitive document to be detected, the method further includes:
and outputting the sample document sequence number corresponding to the document to be detected classified as the sensitive document, and finishing the alarm.
Optionally, the comparing the fingerprint characteristics of the document to be detected and the sample document to identify a sensitive document to be detected includes:
calculating the cosine similarity of the fingerprints of the document to be detected and the sample document;
and if the similarity is higher than the threshold value, classifying the current document to be detected as a sensitive document.
In addition, optionally, after comparing the fingerprint characteristics of the document to be detected with the fingerprint characteristics of the sample document and identifying the sensitive document to be detected, the method further includes:
step S201: and outputting the sample document sequence number corresponding to the document to be detected classified as the sensitive document, and finishing the alarm.
The sensitive text recognition apparatus based on natural semantics according to an exemplary embodiment includes:
the word segmentation module is used for carrying out Chinese word segmentation on the sample document and the document to be detected;
the word vectorization module is used for carrying out word-by-word vectorization on the sample document based on the massive corpus word vector library;
the document fingerprint calculation module is used for respectively carrying out vectorization analysis on the sample document and the document to be detected and extracting fingerprint characteristics of the sample document and the document to be detected;
and the document fingerprint similarity calculation module is used for comparing the fingerprint characteristics of the document to be detected with the sample document and identifying the sensitive document to be detected.
Optionally, the sensitive text recognition apparatus further includes:
and the natural corpus pre-training module is used for training the natural corpus to obtain a massive corpus word vector library.
According to the method and the device for identifying the sensitive text based on the natural semantics, word vectors are utilized to generate characteristic fingerprints of the document from a natural semantic level, and then the text to be detected containing sensitive information is identified through comparison of the fingerprints, compared with the prior art, the method has the advantages that ① low missing report rate is achieved, the method is based on semantic level modeling, linguistic information including but not limited to synonyms, similar words, grammar, sentence patterns and the like can be identified, even if the word expression or paragraph sequence is modified, whether the document is similar to a sample document can be accurately obtained, detection through modification sequence and expression avoidance is blocked, ② efficiency and effect are excellent, training speed of the document vectorization on the new words is high, comparison effect of similarity is superior to that of a plurality of advanced neural network models (such as a plurality of RNN and LSTM models), ③ cross-field is achieved, the model based on massive Chinese corpus training covers most semantic information of various industry fields, ④ long and short documents are applicable, and the papers or the messages of a plurality of crosses have unusual expressions.
The foregoing is illustrative of the present invention and various modifications and changes in form or detail will readily occur to those skilled in the art based upon the teachings herein and the application of the principles and principles disclosed herein, which are to be regarded as illustrative rather than restrictive on the broad principles of the present invention.

Claims (10)

1. A sensitive text recognition method based on natural semantics is characterized by comprising the following steps:
acquiring a massive corpus word vector library based on natural corpus;
performing word segmentation on the sample document;
performing word-by-word vectorization on the sample document, correcting the massive corpus word vector library, and establishing a small corpus word vector library based on the new words in the sample document;
based on the corrected mass corpus word vector library and the corrected small corpus word vector library, performing vectorization analysis on the sample document, and extracting fingerprint characteristics of the sample document;
carrying out word segmentation, word-by-word vectorization and document vectorization analysis on the documents to be detected in sequence to obtain fingerprint characteristics of the documents;
and comparing the fingerprint characteristics of the document to be detected with the fingerprint characteristics of the sample document, and identifying the sensitive document to be detected.
2. The sensitive text recognition method according to claim 1, wherein the method for obtaining a mass corpus word vector library based on natural corpus comprises:
and training the natural corpus by using a natural language processing model to obtain the massive corpus word vector library.
3. The method according to claim 1, wherein said vectorizing word by word of the sample document, modifying said corpus word vector library, and establishing a corpus word vector library based on the new words in the sample document comprises:
taking all vocabularies obtained after word segmentation of the sample document as input, and positioning word vectors of the words in the massive corpus word vector library one by one;
correcting the word vector of the word according to the relation between the word and other words in the sample document;
and analyzing word vectors of the new words which are not in the massive corpus word vector library according to the context of the document where the words are located to obtain word vectors, storing the word vectors into a small corpus word vector library, and continuously updating the small corpus word vector library along with the addition of new words.
4. The sensitive text recognition method according to claim 1 or 2, wherein a skip-gram model optimized and accelerated by using negative sampling in Word2Vec is adopted to train a natural corpus to obtain the mass corpus Word vector library, or the mass corpus Word vector library is corrected.
5. The sensitive text recognition method of claim 3, wherein: and analyzing the word vectors of the new words not in the massive corpus word vector library by adopting a Nonce2Vec method according to the context of the document where the words are located to obtain word vectors.
6. The sensitive text recognition method according to claim 1, wherein an SIF algorithm is adopted, and a sample document is subjected to vectorization analysis based on the corrected mass corpus word vector library and the corrected small corpus word vector library to extract fingerprint features of the sample document.
7. The sensitive text identification method according to claim 1, wherein the comparing the fingerprint characteristics of the document to be detected and the sample document to identify the sensitive document to be detected comprises:
calculating the cosine similarity of the fingerprints of the document to be detected and the sample document;
and if the similarity is higher than the threshold value, classifying the current document to be detected as a sensitive document.
8. The method of claim 1, wherein after comparing the fingerprint characteristics of the document to be tested with the fingerprint characteristics of the sample document and identifying the sensitive document to be tested, the method further comprises:
and outputting the sample document sequence number corresponding to the document to be detected classified as the sensitive document, and finishing the alarm.
9. A sensitive text recognition apparatus applying the sensitive text recognition method according to any one of claims 1 to 8, comprising:
the word segmentation module is used for carrying out Chinese word segmentation on the sample document and the document to be detected;
the word vectorization module is used for carrying out word-by-word vectorization on the sample document based on the massive corpus word vector library;
the document fingerprint calculation module is used for respectively carrying out vectorization analysis on the sample document and the document to be detected and extracting fingerprint characteristics of the sample document and the document to be detected;
and the document fingerprint similarity calculation module is used for comparing the fingerprint characteristics of the document to be detected with the sample document and identifying the sensitive document to be detected.
10. The sensitive text recognition apparatus of claim 9, further comprising:
and the natural corpus pre-training module is used for training the natural corpus to obtain a massive corpus word vector library.
CN202010012173.0A 2020-01-07 2020-01-07 Sensitive text recognition method and device based on natural semantics Pending CN111209373A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010012173.0A CN111209373A (en) 2020-01-07 2020-01-07 Sensitive text recognition method and device based on natural semantics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010012173.0A CN111209373A (en) 2020-01-07 2020-01-07 Sensitive text recognition method and device based on natural semantics

Publications (1)

Publication Number Publication Date
CN111209373A true CN111209373A (en) 2020-05-29

Family

ID=70788651

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010012173.0A Pending CN111209373A (en) 2020-01-07 2020-01-07 Sensitive text recognition method and device based on natural semantics

Country Status (1)

Country Link
CN (1) CN111209373A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116305285A (en) * 2023-03-30 2023-06-23 肇庆学院 Patient information desensitization processing method and system combining artificial intelligence
CN117349407A (en) * 2023-12-04 2024-01-05 江苏君立华域信息安全技术股份有限公司 Automatic detection method and system for content security
CN117993018A (en) * 2024-03-29 2024-05-07 蚂蚁科技集团股份有限公司 Access method of third party large language model and gateway server

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070143322A1 (en) * 2005-12-15 2007-06-21 International Business Machines Corporation Document comparision using multiple similarity measures
CN106484664A (en) * 2016-10-21 2017-03-08 竹间智能科技(上海)有限公司 Similarity calculating method between a kind of short text
CN106776534A (en) * 2016-11-11 2017-05-31 北京工商大学 The incremental learning method of term vector model
CN109101476A (en) * 2017-06-21 2018-12-28 阿里巴巴集团控股有限公司 A kind of term vector generates, data processing method and device
CN109344407A (en) * 2018-10-29 2019-02-15 北京天融信网络安全技术有限公司 Semantic-based document fingerprint construction method, storage medium and computer equipment
CN109766525A (en) * 2019-01-14 2019-05-17 湖南大学 A kind of sensitive information leakage detection framework of data-driven

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070143322A1 (en) * 2005-12-15 2007-06-21 International Business Machines Corporation Document comparision using multiple similarity measures
CN106484664A (en) * 2016-10-21 2017-03-08 竹间智能科技(上海)有限公司 Similarity calculating method between a kind of short text
CN106776534A (en) * 2016-11-11 2017-05-31 北京工商大学 The incremental learning method of term vector model
CN109101476A (en) * 2017-06-21 2018-12-28 阿里巴巴集团控股有限公司 A kind of term vector generates, data processing method and device
CN109344407A (en) * 2018-10-29 2019-02-15 北京天融信网络安全技术有限公司 Semantic-based document fingerprint construction method, storage medium and computer equipment
CN109766525A (en) * 2019-01-14 2019-05-17 湖南大学 A kind of sensitive information leakage detection framework of data-driven

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
姜雪: ""基于simhash的文本相似检测算法研究"", pages 6 - 18 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116305285A (en) * 2023-03-30 2023-06-23 肇庆学院 Patient information desensitization processing method and system combining artificial intelligence
CN116305285B (en) * 2023-03-30 2024-04-05 肇庆学院 Patient information desensitization processing method and system combining artificial intelligence
CN117349407A (en) * 2023-12-04 2024-01-05 江苏君立华域信息安全技术股份有限公司 Automatic detection method and system for content security
CN117349407B (en) * 2023-12-04 2024-01-30 江苏君立华域信息安全技术股份有限公司 Automatic detection method and system for content security
CN117993018A (en) * 2024-03-29 2024-05-07 蚂蚁科技集团股份有限公司 Access method of third party large language model and gateway server

Similar Documents

Publication Publication Date Title
CN109977416B (en) Multi-level natural language anti-spam text method and system
CN110096570B (en) Intention identification method and device applied to intelligent customer service robot
CN108304372B (en) Entity extraction method and device, computer equipment and storage medium
US11544459B2 (en) Method and apparatus for determining feature words and server
CN107193796B (en) Public opinion event detection method and device
KR20110038474A (en) Apparatus and method for detecting sentence boundaries
CN111209373A (en) Sensitive text recognition method and device based on natural semantics
CN113742733B (en) Method and device for extracting trigger words of reading and understanding vulnerability event and identifying vulnerability type
CN116127953B (en) Chinese spelling error correction method, device and medium based on contrast learning
CN112287100A (en) Text recognition method, spelling error correction method and voice recognition method
CN112468659A (en) Quality evaluation method, device, equipment and storage medium applied to telephone customer service
CN115080750B (en) Weak supervision text classification method, system and device based on fusion prompt sequence
CN114416979A (en) Text query method, text query equipment and storage medium
CN115759071A (en) Government affair sensitive information identification system and method based on big data
CN116150651A (en) AI-based depth synthesis detection method and system
CN113486174B (en) Model training, reading understanding method and device, electronic equipment and storage medium
Birla A robust unsupervised pattern discovery and clustering of speech signals
CN112784601A (en) Key information extraction method and device, electronic equipment and storage medium
CN110377753B (en) Relation extraction method and device based on relation trigger word and GRU model
CN112528653A (en) Short text entity identification method and system
CN115858776B (en) Variant text classification recognition method, system, storage medium and electronic equipment
CN115630357B (en) Method for judging behavior of collecting personal information by application program crossing boundary
CN116305257A (en) Privacy information monitoring device and privacy information monitoring method
CN115983266A (en) Pinyin variant text identification method and system for checking credit investigation data of bank
CN116070642A (en) Text emotion analysis method and related device based on expression embedding

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination