CN112612892A - Special field corpus model construction method, computer equipment and storage medium - Google Patents

Special field corpus model construction method, computer equipment and storage medium Download PDF

Info

Publication number
CN112612892A
CN112612892A CN202011589591.2A CN202011589591A CN112612892A CN 112612892 A CN112612892 A CN 112612892A CN 202011589591 A CN202011589591 A CN 202011589591A CN 112612892 A CN112612892 A CN 112612892A
Authority
CN
China
Prior art keywords
corpus
model
frequency
words
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011589591.2A
Other languages
Chinese (zh)
Other versions
CN112612892B (en
Inventor
顾嘉晟
岳小龙
高翔
纪达麒
陈运文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Daguan Data Chengdu Co ltd
Original Assignee
Daguan Data Chengdu Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Daguan Data Chengdu Co ltd filed Critical Daguan Data Chengdu Co ltd
Priority to CN202011589591.2A priority Critical patent/CN112612892B/en
Publication of CN112612892A publication Critical patent/CN112612892A/en
Application granted granted Critical
Publication of CN112612892B publication Critical patent/CN112612892B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a method for constructing a corpus model in a special field, computer equipment and a storage medium, wherein the method comprises the following steps: step one, corpus collection and pretreatment: obtaining sufficient pure unsupervised corpora through data cleaning; step two, analyzing word frequency and inverse text frequency index: identifying words with higher importance degree in the pure unsupervised corpus by a TF-IDF statistical method; step three, data enhancement: enhancing sentences in which the high-frequency words extracted in the step two are located; step four, training a language model: modeling the pure unsupervised corpora after being enhanced in the third step through an XLNET model to generate a special field corpus model. According to the invention, the classification task accuracy, the recall rate and the F1 value can be obviously improved through the special field corpus model generated by the special corpus after data enhancement. The method can greatly shorten the process of pre-training the language model and greatly reduce the resource consumption in the pre-training process.

Description

Special field corpus model construction method, computer equipment and storage medium
Technical Field
The invention relates to the technical field of natural language processing, in particular to a corpus model construction method in the special field, computer equipment and a storage medium.
Background
A large amount of text processing work exists in daily operation of enterprises, the types of documents are quite various, and each type of document has relatively fixed format, specification, fixed collocation and the like. A large number of application scenarios of text natural language processing exist in daily text document processing work of enterprises, such as text word segmentation, document format type classification, text emotion analysis, key information extraction, contract document review, document similarity calculation and the like.
Currently, in both academic and industrial fields, most NLP tasks pass through pre-training language models, such as n-grams, berts, GPTs, and their variants. The pre-training idea is that parameters in the deep neural network are not initialized randomly, and pre-training is completed through a task of a language model, so that a downstream NLP task is completed through a transfer learning idea.
However, in the actual domain-specific document processing, there is a problem that the general corpus and the training model do not include enough language patterns such as the term specific to the industry, the fixed grammar collocation of the specific document, and the like in the corpus used for pre-training, so that the complete key information cannot be accurately hit in the downstream task such as the key information extraction task. In text classification or key information extraction tasks in, for example, financial-specific fields, language models generated by general corpora tend to have some bias in semantic understanding.
For another example, in a text word segmentation task, a certain preset keyword can only be roughly truncated in a simpler manner of adding a keyword dictionary.
Disclosure of Invention
In order to solve various defects of a general language model in a document NLP task in a specific industry field, the invention provides a method for constructing a corpus model in a special field, computer equipment and a storage medium. The special field corpus model generated in the invention can be used for multiple document types and multiple NLP tasks, and can reduce the fine tuning time of a downstream model, thereby achieving the purpose of reducing the resource consumption of a server.
The technical scheme of the invention is as follows:
a method for constructing a corpus model in a special field comprises the following steps:
step one, corpus collection and pretreatment: obtaining sufficient pure unsupervised corpora through data cleaning;
step two, analyzing word frequency and inverse text frequency index: identifying words with higher importance degree in the pure unsupervised corpus by a TF-IDF statistical method, removing common words by using a reverse frequency index in the TF-IDF statistical method, and using the words with higher word frequency in the remaining words as high-frequency words of the current text or high-frequency words in the corpus of the special field;
step three, data enhancement: enhancing the sentences in which the high-frequency words extracted in the step two are located, wherein the enhancing method comprises the following steps: copying the paragraph where the high-frequency word is located, and randomly inserting the copied paragraph into any position of the pure unsupervised corpus;
step four, training a language model: modeling the pure unsupervised corpora enhanced in the third step through an XLNET model to generate a special field corpus model; and when the special field corpus model is trained, segmenting the corpus according to a segmentation model generated by the special field corpus model, and performing iterative training again to improve the language model.
Further, in the first step, the data cleaning includes analyzing and extracting characters in the mass PDF files, and the analyzing manner includes:
the continuity of text content is kept, and the text content is divided through paragraphs, so that the context in the paragraphs is consistent;
the document title is used as a single paragraph, and the paragraph title in the body is used as a single paragraph to ensure the continuity of the upper and lower sentences.
Further, for the contents in two or more columns, if the directly read contents are coherent semantics, the contents are used as pure corpora, otherwise, the contents are discarded.
Further, the parsing method further includes:
text content is converted into complex characters and simple characters, and all complex characters are converted into simple characters;
author information, directory content, pictures, charts, tables, headers, and footer information are removed.
Further, in the second step, the occurrence frequency of the words in the current text, namely word frequency, is calculated, the occurrence frequency of each word in all the texts, namely inverse text frequency, is calculated, and finally common words are filtered out through the product of the word frequency and the inverse text frequency, so that important words for each file are reserved.
Further, in the third step, an enhancement amplitude is set, and if the pure unsupervised corpus obtained in the first step is smaller, the enhancement amplitude is set to be 3-5 times of the high-frequency word paragraph.
Furthermore, in the fourth step, the context word is used for predicting the next word, and the language material model in the special field is pre-trained in the forward direction and the backward direction.
Furthermore, for a given text sequence, prediction is performed through the preamble or the postamble of each token, and then the probability of each time step is multiplied together to be used as an objective function of the model.
A computer device comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the steps of the method for constructing the language material model of the proprietary field when executing the computer program.
A computer-readable storage medium storing a computer program which, when executed by a processor, implements the steps of the above-described method for constructing a domain-specific corpus model.
The invention has the beneficial effects that:
1. in the text classification or key information extraction task in the special field, the language model generated by the universal corpus has certain deviation in semantic understanding, and the accuracy, recall rate and F1 value of the classification task can be obviously improved through the special field corpus model generated by the data-enhanced special corpus.
2. The language model pre-training process is usually a process of consuming a large amount of display card resources, and the order of magnitude of the language material in the special field is far smaller than that of the general language material.
3. Because the language material model of the special field generated by the invention has stronger semantic understanding capability to the text in the special field, the downstream NLP task can realize higher accuracy on a smaller training set, thereby achieving the purpose of reducing the training cost.
Drawings
Fig. 1 is a flowchart of a method for constructing a corpus model in a proprietary domain according to embodiment 2 of the present invention.
Detailed Description
In order to more clearly understand the technical features, objects, and effects of the present invention, specific embodiments of the present invention will now be described. It should be understood that the detailed description and specific examples, while indicating the preferred embodiment of the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.
Example 1
The embodiment provides a method for constructing a corpus model in a special field, which comprises the following steps:
step one, corpus collection and pretreatment
Many industries, such as the financial industry, due to informatization disclosure requirements, can find large numbers of naturally published PDF-formatted files on the web, of types including, but not limited to, bond recruitment specifications, treatise specifications, investment fund contracts, equity pledges.
In this step, the characters in the mass PDF need to be analyzed and extracted to obtain sufficient pure unsupervised corpus, and the specific analysis mode includes:
(1) the continuity of text content is kept, and the text content is divided through paragraphs, so that the context in the paragraphs is consistent;
(2) text content is converted into complex characters and simple characters, and all complex characters are converted into simple characters;
(3) the document title is used as an independent paragraph, and the paragraph title in the body is used as an independent paragraph to ensure the continuity of the upper sentence and the lower sentence;
(4) author information, directory content, pictures, charts, tables, headers, and footer information are removed.
(5) For double-column or multi-column contents, if the directly read contents are coherent semantics, the contents are used as pure corpora, otherwise, the contents are discarded.
Step two, analyzing word frequency and inverse text frequency index
Compared with the number of the general linguistic data, the number of the special linguistic data is smaller by one order of magnitude, so that the data enhancement is carried out on the special sentence patterns, the special nouns and the like in the linguistic data in the third step and the third step in a data enhancement mode, and the model is more familiar with the use.
The method comprises the steps of obtaining sufficient pure unsupervised linguistic data through data cleaning in the first step, identifying words with high importance degree in the pure unsupervised linguistic data through a TF-IDF statistical method in the step, removing common words through reverse frequency indexes in the TF-IDF statistical method, using words with high word frequency in the remaining words as high-frequency words of a current text, and enabling the words to be used as the high-frequency words appearing in the special field linguistic data to be specially processed.
Specifically, the occurrence frequency of the words in the current text, namely word frequency, is calculated, the occurrence frequency of each word in all the texts, namely inverse text frequency, is calculated, and common words are filtered out by the product of the word frequency and the inverse text frequency, so that important words for each file are retained.
Step three, data enhancement
And enhancing the sentences in which the high-frequency words extracted in the step two are located, wherein the enhancing method comprises the following steps: and (4) copying the paragraph of the high-frequency word, and randomly inserting the copied paragraph into any position of the pure unsupervised corpus. In the step, an enhancement amplitude needs to be set, and if the pure unsupervised corpus obtained in the step one is small, the enhancement amplitude is set to be 3-5 times of the high-frequency word paragraph, so that a better language model effect is achieved.
Step four, training the language model
The language model pre-training adopts an autoregressive pre-training method XLNET model, and in the step, the XLNET model is used for modeling the enhanced pure unsupervised linguistic data in the step three to generate a special field linguistic data model.
The auto-regressive language model (AR) of XLNet uses context words to predict the next word, pre-training the language model in both forward and backward directions. Meanwhile, XLNET adopts an Attention Mask mechanism, a part of words are dropped from random Mask in the transform, and a transform XL mechanism is added, so that the problems of pre-training and fine-tuning effect distortion during long text language modeling can be solved by XLNET. When the language model training is completed, better word segmentation can be performed on the corpus according to the word segmentation model generated by the model, so that the accuracy of TF-IDF is improved, iterative training is performed again, and the language model is improved.
In the XLNET autoregressive pre-training process, the forward process predicts the current word through the preceding word of a word in the current sentence, and the backward process predicts the current word through all the following words of the word in the sentence. For a given text sequence, the AR model is mainly predicted by the preamble or postamble of each token, and finally the probability of each time step is multiplied as the objective function of the model. Secondly, random arrangement and a double-flow attention mechanism are introduced to a Transformer XL mechanism, so that the training resource consumption of a downstream NLP task during fine adjustment is reduced.
Correspondingly, the embodiment provides a computer device, which includes a memory and a processor, where the memory stores a computer program, and the processor implements the steps of the method for constructing the corpus model in the proprietary domain when executing the computer program.
In addition, the embodiment further provides a computer-readable storage medium, in which a computer program is stored, and the computer program, when executed by a processor, implements the steps of the method for constructing the corpus model in the proprietary domain.
Example 2
This example is based on example 1:
assuming that a media user has acquired a large number of stock-related files using a crawler, it is desirable to distinguish which stock each document belongs to by a classification model and analyze whether it is a profit or a profit.
Correspondingly, the embodiment provides a method for constructing a corpus model in a proprietary domain, as shown in fig. 1, including the following steps:
step 1, analyzing all files, extracting pure text information in PDF, and cleaning and preprocessing the obtained text;
step 2, performing word frequency and inverse frequency analysis on the corpus by using a TF-IDF statistical model to obtain special words in the financial field or high-frequency words in a specific text;
step 3, finding out the paragraph according to the high-frequency words in the step 2, and copying any part of the paragraph in the text for 2 times;
step 4, pre-training a language model by taking the linguistic data generated in the step 3 as an input linguistic data of XLNET;
and 5, if the accuracy of the language model in the step 4 in the downstream task is not obviously improved compared with the accuracy of the general corpus language model, the language model generated in the step 4 can be considered to be used for finely adjusting the word segmentation model, the TF-IDF analysis in the step 2 is repeated, the data enhancement operation is repeated, and the iterative training is completed to generate a new language model.
The foregoing is illustrative of the preferred embodiments of this invention, and it is to be understood that the invention is not limited to the precise form disclosed herein and that various other combinations, modifications, and environments may be resorted to, falling within the scope of the concept as disclosed herein, either as described above or as apparent to those skilled in the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (10)

1. A method for constructing a corpus model in a special field is characterized by comprising the following steps:
step one, corpus collection and pretreatment: obtaining sufficient pure unsupervised corpora through data cleaning;
step two, analyzing word frequency and inverse text frequency index: identifying words with higher importance degree in the pure unsupervised corpus by a TF-IDF statistical method, removing common words by using a reverse frequency index in the TF-IDF statistical method, and using the words with higher word frequency in the remaining words as high-frequency words of the current text or high-frequency words in the corpus of the special field;
step three, data enhancement: enhancing the sentences in which the high-frequency words extracted in the step two are located, wherein the enhancing method comprises the following steps: copying the paragraph where the high-frequency word is located, and randomly inserting the copied paragraph into any position of the pure unsupervised corpus;
step four, training a language model: modeling the pure unsupervised corpora enhanced in the third step through an XLNET model to generate a special field corpus model; and when the special field corpus model is trained, segmenting the corpus according to a segmentation model generated by the special field corpus model, and performing iterative training again to improve the language model.
2. The method for constructing the corpus model in the proprietary domain according to claim 1, wherein in the first step, the data cleaning comprises analyzing and extracting characters in the mass PDF files, and the analyzing mode comprises:
the continuity of text content is kept, and the text content is divided through paragraphs, so that the context in the paragraphs is consistent;
the document title is used as a single paragraph, and the paragraph title in the body is used as a single paragraph to ensure the continuity of the upper and lower sentences.
3. The method as claimed in claim 2, wherein for two or more columns of content, if the directly read content is coherent semantic, it is used as pure corpus, otherwise it is discarded.
4. The method according to claim 2, wherein the parsing further comprises:
text content is converted into complex characters and simple characters, and all complex characters are converted into simple characters;
author information, directory content, pictures, charts, tables, headers, and footer information are removed.
5. The method for constructing a corpus model of the proprietary domain according to claim 1, wherein in the second step, the occurrence frequency of words in the current text, i.e., word frequency, is calculated, the occurrence frequency of each word in all texts, i.e., inverse text frequency, is calculated, and finally common words are filtered out by the product of the word frequency and the inverse text frequency, and important words for each document are retained.
6. The method according to claim 1, wherein in step three, an enhancement amplitude is set, and if the pure unsupervised corpus obtained in step one is small, the enhancement amplitude is set to copy the paragraph where the high-frequency word is located by 3-5 times.
7. The method according to claim 1, wherein in step four, the next word is predicted by using the context word, and the domain specific corpus model is pre-trained in both forward and backward directions.
8. The method as claimed in claim 7, wherein the prediction is performed for a given text sequence by the preamble or postamble of each token, and the probability of each time step is multiplied as the objective function of the model.
9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor when executing the computer program implements the steps of the method according to any of claims 1-8.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 8.
CN202011589591.2A 2020-12-29 2020-12-29 Special field corpus model construction method, computer equipment and storage medium Active CN112612892B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011589591.2A CN112612892B (en) 2020-12-29 2020-12-29 Special field corpus model construction method, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011589591.2A CN112612892B (en) 2020-12-29 2020-12-29 Special field corpus model construction method, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112612892A true CN112612892A (en) 2021-04-06
CN112612892B CN112612892B (en) 2022-11-01

Family

ID=75248796

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011589591.2A Active CN112612892B (en) 2020-12-29 2020-12-29 Special field corpus model construction method, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112612892B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113657122A (en) * 2021-09-07 2021-11-16 内蒙古工业大学 Mongolian Chinese machine translation method of pseudo-parallel corpus fused with transfer learning
CN113961669A (en) * 2021-10-26 2022-01-21 杭州中软安人网络通信股份有限公司 Training method of pre-training language model, storage medium and server
EP4141733A1 (en) * 2021-08-26 2023-03-01 Beijing Baidu Netcom Science And Technology Co. Ltd. Model training method and apparatus, electronic device, and storage medium

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106257441A (en) * 2016-06-30 2016-12-28 电子科技大学 A kind of training method of skip language model based on word frequency
CN107861939A (en) * 2017-09-30 2018-03-30 昆明理工大学 A kind of domain entities disambiguation method for merging term vector and topic model
CN108417210A (en) * 2018-01-10 2018-08-17 苏州思必驰信息科技有限公司 A kind of word insertion language model training method, words recognition method and system
CN109189925A (en) * 2018-08-16 2019-01-11 华南师范大学 Term vector model based on mutual information and based on the file classification method of CNN
CN109933804A (en) * 2019-03-27 2019-06-25 北京信息科技大学 Merge the keyword abstraction method of subject information and two-way LSTM
CN110096705A (en) * 2019-04-29 2019-08-06 扬州大学 A kind of unsupervised english sentence simplifies algorithm automatically
CN110705291A (en) * 2019-10-10 2020-01-17 青岛科技大学 Word segmentation method and system for documents in ideological and political education field based on unsupervised learning
US20200042508A1 (en) * 2018-08-06 2020-02-06 Walmart Apollo, Llc Artificial intelligence system and method for auto-naming customer tree nodes in a data structure
US10559009B1 (en) * 2013-03-15 2020-02-11 Semcasting, Inc. System and method for linking qualified audiences with relevant media advertising through IP media zones
CN110851176A (en) * 2019-10-22 2020-02-28 天津大学 Clone code detection method capable of automatically constructing and utilizing pseudo clone corpus
CN110956021A (en) * 2019-11-14 2020-04-03 微民保险代理有限公司 Original article generation method, device, system and server
CN111027327A (en) * 2019-10-29 2020-04-17 平安科技(深圳)有限公司 Machine reading understanding method, device, storage medium and device
CN111767741A (en) * 2020-06-30 2020-10-13 福建农林大学 Text emotion analysis method based on deep learning and TFIDF algorithm

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10559009B1 (en) * 2013-03-15 2020-02-11 Semcasting, Inc. System and method for linking qualified audiences with relevant media advertising through IP media zones
CN106257441A (en) * 2016-06-30 2016-12-28 电子科技大学 A kind of training method of skip language model based on word frequency
CN107861939A (en) * 2017-09-30 2018-03-30 昆明理工大学 A kind of domain entities disambiguation method for merging term vector and topic model
CN108417210A (en) * 2018-01-10 2018-08-17 苏州思必驰信息科技有限公司 A kind of word insertion language model training method, words recognition method and system
US20200042508A1 (en) * 2018-08-06 2020-02-06 Walmart Apollo, Llc Artificial intelligence system and method for auto-naming customer tree nodes in a data structure
CN109189925A (en) * 2018-08-16 2019-01-11 华南师范大学 Term vector model based on mutual information and based on the file classification method of CNN
CN109933804A (en) * 2019-03-27 2019-06-25 北京信息科技大学 Merge the keyword abstraction method of subject information and two-way LSTM
CN110096705A (en) * 2019-04-29 2019-08-06 扬州大学 A kind of unsupervised english sentence simplifies algorithm automatically
CN110705291A (en) * 2019-10-10 2020-01-17 青岛科技大学 Word segmentation method and system for documents in ideological and political education field based on unsupervised learning
CN110851176A (en) * 2019-10-22 2020-02-28 天津大学 Clone code detection method capable of automatically constructing and utilizing pseudo clone corpus
CN111027327A (en) * 2019-10-29 2020-04-17 平安科技(深圳)有限公司 Machine reading understanding method, device, storage medium and device
CN110956021A (en) * 2019-11-14 2020-04-03 微民保险代理有限公司 Original article generation method, device, system and server
CN111767741A (en) * 2020-06-30 2020-10-13 福建农林大学 Text emotion analysis method based on deep learning and TFIDF algorithm

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
冯煜博等: "基于知网相关概念场的中文词向量", 《中文信息学报》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP4141733A1 (en) * 2021-08-26 2023-03-01 Beijing Baidu Netcom Science And Technology Co. Ltd. Model training method and apparatus, electronic device, and storage medium
CN113657122A (en) * 2021-09-07 2021-11-16 内蒙古工业大学 Mongolian Chinese machine translation method of pseudo-parallel corpus fused with transfer learning
CN113657122B (en) * 2021-09-07 2023-12-15 内蒙古工业大学 Mongolian machine translation method of pseudo parallel corpus integrating transfer learning
CN113961669A (en) * 2021-10-26 2022-01-21 杭州中软安人网络通信股份有限公司 Training method of pre-training language model, storage medium and server

Also Published As

Publication number Publication date
CN112612892B (en) 2022-11-01

Similar Documents

Publication Publication Date Title
Albalawi et al. Using topic modeling methods for short-text data: A comparative analysis
US11379668B2 (en) Topic models with sentiment priors based on distributed representations
CN112612892B (en) Special field corpus model construction method, computer equipment and storage medium
Tabassum et al. A survey on text pre-processing & feature extraction techniques in natural language processing
Nagamanjula et al. A novel framework based on bi-objective optimization and LAN2FIS for Twitter sentiment analysis
Atzeni et al. Using frame-based resources for sentiment analysis within the financial domain
Chouigui et al. An arabic multi-source news corpus: experimenting on single-document extractive summarization
Singh et al. Youtube comments sentiment analysis
Gharavi et al. Scalable and language-independent embedding-based approach for plagiarism detection considering obfuscation type: no training phase
Nasim et al. Cluster analysis of urdu tweets
AL-Jumaili A hybrid method of linguistic and statistical features for Arabic sentiment analysis
Dwivedi et al. Sentiment analytics for crypto pre and post covid: topic modeling
Pirovani et al. Studying the adaptation of Portuguese NER for different textual genres
Haider et al. Corporate news classification and valence prediction: A supervised approach
Sarwar et al. Author verification of nahj al-balagha
Gao et al. Detecting comments showing risk for suicide in YouTube
Hamada et al. Sentimental text processing tool for Russian language based on machine learning algorithms
Elarnaoty et al. Machine learning implementations in arabic text classification
Vidyavihar Sentiment analysis in Marathi language
Yadlapalli et al. Advanced Twitter sentiment analysis using supervised techniques and minimalistic features
Elagamy et al. Text mining approach to analyse stock market movement
Goel A study of text mining techniques: Applications and Issues
Sarwar et al. AGI-P: A Gender Identification Framework for Authorship Analysis Using Customized Fine-Tuning of Multilingual Language Model
Umidjon UNLOCKING THE POWER OF NATURAL LANGUAGE PROCESSING (NLP) FOR TEXT ANALYSIS
CN117291192B (en) Government affair text semantic understanding analysis method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant