CN110196910B - Corpus classification method and apparatus - Google Patents

Corpus classification method and apparatus Download PDF

Info

Publication number
CN110196910B
CN110196910B CN201910468030.8A CN201910468030A CN110196910B CN 110196910 B CN110196910 B CN 110196910B CN 201910468030 A CN201910468030 A CN 201910468030A CN 110196910 B CN110196910 B CN 110196910B
Authority
CN
China
Prior art keywords
candidate
words
translation
vector
corpus
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910468030.8A
Other languages
Chinese (zh)
Other versions
CN110196910A (en
Inventor
孙健
周桐
李涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhengzhou Apas Technology Co ltd
Original Assignee
Zhuhai Tianyan Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhuhai Tianyan Technology Co ltd filed Critical Zhuhai Tianyan Technology Co ltd
Priority to CN201910468030.8A priority Critical patent/CN110196910B/en
Publication of CN110196910A publication Critical patent/CN110196910A/en
Application granted granted Critical
Publication of CN110196910B publication Critical patent/CN110196910B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application provides a corpus classification method and device, and belongs to the field of data analysis. The method comprises the following steps: extracting text corpora of each set category respectively to obtain feature words corresponding to the text corpora; translating the feature words according to target languages respectively, and forming translation vectors corresponding to all established categories according to the obtained translations and the vector characteristics corresponding to the feature words respectively; extracting corresponding candidate words in the candidate corpus to form candidate vectors, respectively matching the candidate vectors with the translation vectors corresponding to the established classes, and determining the target class to which the candidate corpus belongs according to the obtained matching degree. According to the method and the device, the keywords in the text corpus of the known language are analyzed, and the translated keywords are matched with the candidate corpus of the unknown language, so that the category of the unknown language is predicted, the corpus can be classified under the condition that no translator of the corresponding language exists, and the information processing efficiency is improved.

Description

Corpus classification method and apparatus
Technical Field
The present application relates to the field of data processing, and in particular, to a method and an apparatus for classifying corpora of unknown languages.
Background
With the explosive growth of the amount of information in the internet, the delivery of information has expanded to spread across the media of multiple countries. Most of network data exist in a text form, and how to classify the text information by using a natural language processing technology enables a user to more accurately and quickly search useful information, which becomes an important research problem in the field of artificial intelligence. At present, when corpora such as web pages, news and the like are classified, machine learning models are mainly used for classification, when the web pages are classified on line, training is continuously carried out according to constructed manually labeled samples of various categories, and then classification models are obtained and then candidate corpora are classified.
However, in a multi-language environment, samples of different languages need to be created, and therefore, manual labeling needs to be performed for each language, and training rules in each language need to be created. If the target language is more, the construction cost is very high, and the information processing efficiency is greatly reduced.
Disclosure of Invention
The embodiment of the application aims to provide a corpus classification method and a corpus classification device so as to meet the requirement of classifying corpus in a multi-language environment.
In order to solve the above technical problem, the embodiment of the present application is implemented as follows:
according to a first aspect of an embodiment of the present application, there is provided a method for corpus classification, the method including:
extracting text corpora of each set category respectively to obtain feature words corresponding to the text corpora;
translating the feature words according to target languages respectively, and forming translation vectors corresponding to all established categories according to the obtained translations and the vector characteristics corresponding to the feature words respectively; the translation vector is used for describing the characteristic attribute corresponding to the characteristic word under each set category in the target language;
extracting corresponding candidate words in the candidate corpus to form candidate vectors, and respectively matching the candidate vectors with the translation vectors corresponding to the established categories to obtain matching degrees of the candidate vectors and the translation vectors of each established category;
and determining the target category of the candidate corpus according to the matching degree.
In an embodiment of the present application, the method further includes:
extracting the weight probability corresponding to each feature word in the translation vector;
performing iterative training by taking the vector features corresponding to the feature words as sample features to obtain the language model;
and matching the language model as a translation vector corresponding to each set category with the candidate vector respectively.
In an embodiment of the present application, when the text corpora of each predetermined category are extracted respectively,
performing word segmentation on the text corpus, and counting keywords obtained after word segmentation;
searching similar words or associated words corresponding to the keywords respectively, and counting vector characteristics corresponding to the keywords;
and respectively setting weights corresponding to the keywords according to the vector characteristics, and screening according to the weights to obtain characteristic words corresponding to the text corpus.
In one embodiment of the present application, when the translation vectors corresponding to each predetermined category are formed,
extracting the vector characteristics corresponding to the characteristic words to obtain the vector characteristics corresponding to the translation, and performing association combination on the translation and the characteristic words to form the translation vector.
In one embodiment of the present application, when the translation vectors corresponding to each predetermined category are formed,
and when the translation corresponding to the feature words in the target language is more than one, respectively associating and combining each translation with the feature words, equally dividing the weight in the vector features corresponding to the feature words, and respectively associating and combining each translation with the feature words to form a plurality of groups of corresponding translation vectors.
In an embodiment of the present application, when extracting corresponding candidate words from the candidate corpus to form candidate vectors,
analyzing the candidate corpus, respectively extracting candidate words therein,
respectively extracting the characteristic attributes corresponding to the candidate words and the weights corresponding to the characteristic attributes respectively to obtain vector characteristics corresponding to the candidate words respectively;
and fitting the vector characteristics corresponding to the candidate words to form candidate vectors corresponding to the candidate corpus.
In an embodiment of the present application, when the candidate vectors are respectively matched with the translation vectors corresponding to each predetermined category,
extracting vector characteristics corresponding to the candidate words in the candidate vectors;
matching the vector characteristics corresponding to the candidate words with the vector characteristics corresponding to each translation vector;
screening out a given category larger than a given threshold value according to the obtained matching degree;
and taking the determined category larger than the determined threshold value as a target category to which the candidate corpus belongs.
According to a second aspect of an embodiment of the present application, there is provided an apparatus for corpus classification, the apparatus including:
the extraction module is used for respectively extracting the text corpora of each set category to obtain the characteristic words corresponding to the text corpora;
the translation module is used for translating the feature words according to target languages respectively and forming translation vectors corresponding to all the established categories according to the obtained translations and the vector characteristics corresponding to the feature words respectively; the translation vector is used for describing the characteristic attribute corresponding to the characteristic word under each set category in the target language;
the matching module is used for extracting corresponding candidate words in the candidate corpus to form candidate vectors, and matching the candidate vectors with the translation vectors corresponding to the established categories respectively to obtain the matching degrees of the candidate vectors and the translation vectors of each established category;
and the dividing module is used for determining the target category to which the candidate corpus belongs according to the matching degree.
In an embodiment of the present application, the apparatus further includes a model unit, which specifically includes:
the extracting unit is used for extracting the weight probability corresponding to each feature word in the translation vector;
the training unit is used for carrying out iterative training by taking the vector characteristics corresponding to the characteristic words as sample characteristics to obtain the language model;
and the matching unit is used for matching the language model as a translation vector corresponding to each set category with the candidate vector respectively.
In an embodiment of the present application, the extracting module specifically includes,
the word segmentation unit is used for segmenting words of the text corpus and counting keywords obtained after word segmentation;
the association unit is used for searching the similar meaning words or the associated words corresponding to the keywords respectively and counting the vector characteristics corresponding to the keywords;
and the screening unit is used for respectively setting weights corresponding to the keywords according to the vector characteristics, and screening according to the weights to obtain the characteristic words corresponding to the text corpus.
In an embodiment of the present application, the translation module specifically includes:
and the association unit is used for extracting the vector characteristics corresponding to the characteristic words to obtain the vector characteristics corresponding to the translation, and forming the translation vector after the translation and the characteristic words are associated and combined.
In an embodiment of the present application, in the translation module, when the translation corresponding to the feature word in the target language is greater than one, each translation is associated and combined with the feature word, the weight in the vector feature corresponding to the feature word is equally divided, and each translation is associated and combined with the feature word, so as to form multiple sets of corresponding translation vectors.
In an embodiment of the present application, the matching module specifically includes:
the analysis unit is used for analyzing the candidate corpus and respectively extracting candidate words in the candidate corpus,
the weight distribution unit is used for respectively extracting the characteristic attributes corresponding to the candidate words and the weights corresponding to the characteristic attributes to respectively obtain the vector characteristics corresponding to the candidate words;
and the fitting unit is used for fitting the vector characteristics corresponding to the candidate words to form candidate vectors corresponding to the candidate corpus.
According to the technical scheme provided by the embodiment of the application, the embodiment of the application extracts the text corpora of each set category respectively to obtain the feature words corresponding to the text corpora; translating the feature words according to target languages respectively, and forming translation vectors corresponding to all established categories according to the obtained translations and the vector characteristics corresponding to the feature words respectively; extracting corresponding candidate words in the candidate corpus to form candidate vectors, and respectively matching the candidate vectors with the translation vectors corresponding to the established categories to obtain matching degrees of the candidate vectors and the translation vectors of each established category; and determining the target category of the candidate corpus according to the matching degree. According to the scheme, the keywords in the text corpus of the known language are analyzed, and the translated keywords are matched with the candidate corpus of the unknown language, so that the category to which the unknown language belongs is predicted, the corpus can be classified under the condition that no translator of the corresponding language exists, and the information processing efficiency is improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only some embodiments described in the present specification, and for those skilled in the art, other drawings can be obtained according to the drawings without any creative effort.
FIG. 1 is a flow diagram of a method of corpus classification according to an embodiment of the present application;
fig. 2 is a schematic structural diagram of a corpus classifying device according to an embodiment of the present application.
Detailed Description
In order to make those skilled in the art better understand the technical solutions in the present specification, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present specification, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present specification without any creative effort shall fall within the protection scope of the present specification.
The embodiment of the application provides a corpus classification method and device.
First, a corpus classification method provided in an embodiment of the present application is described below.
Most of the existing network data exist in a text form, the text data often belong to different languages, when corpora such as webpages and news are classified in the prior art, preset manually labeled samples are continuously trained, a large amount of manual processing is needed, and after training to obtain a classification model, candidate corpora are classified. However, in a multi-language environment, it is often not practical to mark the multi-language samples separately, which reduces the development efficiency. The invention trains the translation vectors of each set category by analyzing the key words in the text corpus of the known language, analyzes the candidate prediction, extracts the candidate vectors and then matches the translation vectors of each category respectively, thereby determining the target category corresponding to the candidate corpus, and the corpus can be classified under the condition that no translator of the corresponding language exists, thereby improving the efficiency of information processing.
Fig. 1 is a flowchart of a corpus classification method according to an embodiment of the present application, which may include the following steps, as shown in fig. 1:
in step 101, the feature words corresponding to the text corpus are obtained by extracting the text corpus of each predetermined category.
In this embodiment, when the text corpora of each predetermined category are extracted,
step 101a, performing word segmentation on the text corpus, and counting keywords obtained after word segmentation;
the full text of the text corpus is divided after semantic analysis, the word frequency corresponding to each real word, particularly nouns and verbs, in the divided words is counted, the positions of the real words in the text corpus are counted, and the real words with the word frequency larger than a set threshold value and/or the positions of the relevant keys such as titles, first sections, last sections and the like are/is used as the keywords corresponding to the text corpus.
Step 101b, searching corresponding similar words or associated words of the keywords respectively, and counting vector characteristics corresponding to the keywords;
respectively searching synonyms or near-synonyms corresponding to the keywords, and counting vector characteristics corresponding to the keywords, in this embodiment, extracting feature attributes corresponding to the keywords (including synonyms and/or near-synonyms), such as word frequency, length, part of speech, position markers, whether to begin, whether to thicken, and the like, and using the feature attributes as the vector characteristics of the keywords.
And 101c, respectively setting weights corresponding to the keywords according to the vector characteristics, and screening according to the weights to obtain characteristic words corresponding to the text corpus.
Because the keywords in the text corpus are too many, the keywords need to be screened according to the weight, and the feature words representing the corresponding features of the established category are extracted from the keywords, even if the extracted feature words accurately represent the category features corresponding to the established category.
Specifically, weights corresponding to the vector attributes of the keywords are respectively set, and if the weights are summed, the attributes corresponding to the keywords are filtered according to the values obtained after weighted summation.
Step 102: translating the feature words according to target languages respectively, and forming translation vectors corresponding to all established categories according to the obtained translations and the vector characteristics corresponding to the feature words respectively; the translation vector is used for describing the characteristic attribute corresponding to the characteristic word under each set category in the target language.
Calling a set translation word bank, translating each feature word according to a target language respectively to obtain a translation corresponding to each feature word, extracting feature attributes such as word frequency, length, part of speech, position mark, beginning, thickening and the like corresponding to the translation, counting according to the feature attributes and corresponding weights to obtain vector features corresponding to the translation, and combining the vector features corresponding to the untranslated feature words with the vector features corresponding to the untranslated feature words to form a translation vector.
In this embodiment, the translation vector includes two parts, one part is a feature word derived from a text of a predetermined category and represents a corresponding feature of the predetermined category, and the other part is a translation corresponding to the feature word and represents a language attribute of the feature corresponding to the predetermined category. Under normal conditions, in the text corpus corresponding to each set category, because the translated text is obtained after the feature words are translated, the vector characteristics corresponding to each feature word are equal to the vector characteristics corresponding to the translated text, but particularly, if more than one translated text corresponding to the feature words is available, the feature words and each translated text form translated text vectors respectively, at the moment, all the weights corresponding to the vector characteristics of the translated text are divided equally, the weights and the vector characteristics of the feature words are combined to form a plurality of groups of corresponding translated text vectors, and the corresponding weights in the translated text vectors are the average values of the keywords and the corresponding translated text weights.
In other embodiments, the vector features in the translation vectors and the vector features corresponding to the feature words are trained to obtain a language model, and the model structure is used to replace the translation vectors of each predetermined category, specifically:
102a, extracting the weight probability corresponding to each feature word in the translation vector;
and extracting feature words in each translation vector corresponding to each set category and vector features corresponding to the translations, and respectively normalizing weight parameters in the vector features into weight probabilities.
Step 102b, performing iterative training by taking the vector characteristics corresponding to the characteristic words as sample characteristics to obtain the language model;
after the vector features corresponding to the feature words in each established category are also normalized, an SVM (support vector machine) method is adopted to train the vector features corresponding to the feature words in each established category and the vector features corresponding to the translation vectors. Taking the difference value of the vector characteristic corresponding to each characteristic word in the text corpus under the category and the vector characteristic corresponding to the translation vector as a positive sample; taking the difference value of the vector characteristics corresponding to the non-characteristic words in the text corpus under the category and the vector characteristics corresponding to the translation vector as a negative sample; and respectively carrying out iterative training in each set type of text corpus according to the sample characteristics to obtain a language model, thereby carrying out category division on each translation in the language and judging the corresponding probability of the translation belonging to a certain set category.
And 102c, matching the language model as a translation vector corresponding to each set type with the candidate vector respectively.
In the subsequent step, candidate words are extracted from the candidate corpus to form candidate vectors, the candidate vectors are matched in the language model to obtain the probability that the candidate vectors belong to each established category, and the category corresponding to the highest probability is used as the target category to which the candidate corpus belongs.
Step 103: extracting corresponding candidate words in the candidate corpus to form candidate vectors, and respectively matching the candidate vectors with the translation vectors corresponding to the established categories to obtain matching degrees of the candidate vectors and the translation vectors of each established category;
when the candidate corpus is converted into the candidate vector, the method comprises the following steps:
step 103a, analyzing the corpus candidate, extracting candidate words respectively,
performing word segmentation on the candidate corpus of the target language, screening according to the word segmentation result, and taking the corresponding real words in the word segmentation result, particularly the words frequency corresponding to nouns and verbs respectively, and the positions of the real words in the text corpus, and taking the real words with the word frequency larger than a set threshold value and/or the positions of relevant keys such as titles, first sections, last sections and the like as the candidate words corresponding to the candidate corpus.
Step 103b, respectively extracting the feature attributes corresponding to the candidate words and the weights corresponding to the feature attributes, respectively obtaining vector features corresponding to the candidate words;
in this embodiment, feature attributes such as word frequency, length, part of speech, position mark, whether to begin or not, whether to thicken, and the like corresponding to the keywords (including synonyms and/or synonyms) are extracted, and the feature attributes are associated with corresponding weights to obtain vector features corresponding to the candidate words, respectively.
And 103c, fitting the vector characteristics corresponding to the candidate words to form candidate vectors corresponding to the candidate corpus.
In this embodiment, the vector features corresponding to the candidate words are normalized to form candidate vectors corresponding to the candidate corpus.
When the candidate vectors are respectively matched with the translation vectors corresponding to the established categories, extracting the vector characteristics corresponding to the candidate words in the candidate vectors; matching the vector characteristics with the translation vectors corresponding to each set category respectively; if the matching degree of the candidate vector and a certain translation vector is larger, the candidate vector and the translation vector have similar characteristic words, and the characteristic attributes corresponding to the characteristic words are also similar, that is, the probability that the candidate corpus and the text corpus of the set category belong to the same category is larger. Otherwise, if the matching degree of the candidate vector and a certain translation vector is small, the difference between the candidate vector and the feature word corresponding to the translation vector is large, that is, the probability that the candidate corpus and the text corpus of the predetermined category belong to the same category is small.
In other embodiments, the trained language model is called, candidate vectors corresponding to the candidate predictions are matched in the language model, and the correlation scores of the candidate vectors corresponding to the translation vectors are judged so as to be selected in the subsequent steps.
And 104, determining the target category of the candidate corpus according to the matching degree.
And selecting the established category with the matching degree larger than the established threshold value after matching as the target category, wherein the candidate corpus belongs to the target category.
In other embodiments, if there are more than one target classes with matching degrees greater than a predetermined threshold corresponding to the candidate vector and the translation vector of each predetermined class, the corpus candidate belongs to more than one predetermined class.
According to the scheme, the keywords in the text corpus of the known language are analyzed, and the translated keywords are matched with the candidate corpus of the unknown language, so that the category to which the unknown language belongs is predicted, the corpus can be classified under the condition that no translator of the corresponding language exists, and the information processing efficiency is improved.
In another alternative embodiment, when the corpus is classified
Step 201: extracting text corpora of each set category respectively to obtain feature words corresponding to the text corpora;
acquiring each real word and synonyms thereof, prefix related words, public substring related words and semantic related words from a document set corresponding to the text corpus of the set category, and marking the corresponding set as S;
step 202: translating the feature words according to target languages respectively, and forming translation vectors corresponding to all established categories according to the obtained translations and the vector characteristics corresponding to the feature words respectively; the translation vector is used for describing the characteristic attribute corresponding to the characteristic word under each set category in the target language;
in this embodiment, the set S is translated in the target language, and the set corresponding to the translated text is denoted as D. And extracting keywords matched with the text corpora of the set category in the set S as feature words, training the feature words by using a topic model or a word embedding mode to generate feature attributes corresponding to the feature words, and generating the feature attributes of the translation corresponding to the feature words in the same way.
When calculating the weight factor corresponding to the characteristic attribute of the characteristic word, calculating by integrating the TF/IDF, the page aggregation value and the semantic aggregation value:
1) TF/IDF, calculating the TF/IDF value corresponding to the characteristic word and marking as g 1;
the TF/IDF represents the product of the word frequency of the feature word and the inverse document frequency. Wherein, TF is the word frequency corresponding to the characteristic word, IDF is the number of the characteristic word appearing in the document set, and then the reciprocal is taken.
2) Page aggregation degree value: under the condition that a sentence-level sliding window is M (positive integer), the number of other characteristic words is recorded as g 2;
3) semantic aggregation degree value: in the range of a field N of a vector space, the number of other feature words is recorded as g 3;
the weight corresponding to the feature word is as follows: g-a 1G 1+ a 2G 2+ a 3G 3, wherein a1, a2, a3 are given coefficients.
Similarly, when calculating the weight factor corresponding to the feature attribute of the translation corresponding to a certain feature word, calculating by integrating the TF/IDF, the page aggregation value and the semantic aggregation value:
1) TF/IDF, calculating the TF/IDF value of the translation corresponding to the feature word, and marking as h 1;
the TF/IDF represents the product of the word frequency of the feature word and the inverse document frequency.
2) Page aggregation degree value: under the condition that a sentence level sliding window is M, the number of translations corresponding to other feature words is recorded as h 2;
3) semantic aggregation degree value: and in the range of the field N of a vector space, the number of translations corresponding to other feature words is marked as h 3.
The weight of the translation corresponding to the feature words is as follows: h-b 1 × H1+ b2 × H2+ b3 × H3, where b1, b2, b3 are given coefficients.
Therefore, in summary, calculating the translation vector corresponding to the text corpus of the predetermined category is:
Figure GDA0003396178420000091
wherein, VdocIs a translation vector corresponding to a text corpus doc of a given category, n is the number of characteristic words in the text corpus doc, i is 1, 2, 3, …, n, VwiIs a characteristic word wiFeature attributes corresponding to feature word translations, GwiIs a characteristic word wiCorresponding weight, HwiIs a characteristic word wiThe weight of the corresponding translation.
Further, a certain number of real words (verbs, names and the like) are selected from the text corpus of the set category as tag words, the feature vector of the positive sample is calculated, a certain number of other words (dummy words, sighs and the like) are selected as non-tag words, the feature vector of the negative sample is calculated, and the feature vector of the positive sample and the feature vector of the negative sample are respectively subtracted from the translation vector and then normalized, so that the regression model is trained.
Step 203: extracting corresponding candidate words in the candidate corpus to form candidate vectors, and respectively matching the candidate vectors with the translation vectors corresponding to the established categories to obtain matching degrees of the candidate vectors and the translation vectors of each established category;
in step 202, in this embodiment, word segmentation is performed on the candidate corpus, the result obtained by extracting the word segmentation and the keyword matched in the set D are used as candidate words, and meanwhile, the corresponding original text of the candidate words before translation is obtained, so as to subsequently form candidate vectors.
And calculating the weight corresponding to each candidate word and the original text in the candidate corpus to further form a candidate vector corresponding to the candidate corpus.
And analyzing the candidate vectors and the regression model obtained in the step 202 to obtain the matching degree of the candidate vectors and the translation vectors of each set type.
Step 204: and determining the target category of the candidate corpus according to the matching degree.
The invention trains the translation vectors of each set category by analyzing the key words in the text corpus of the known language, analyzes the candidate prediction, extracts the candidate vectors and then matches the translation vectors of each category respectively, thereby determining the target category corresponding to the candidate corpus, and the corpus can be classified under the condition that no translator of the corresponding language exists, thereby improving the efficiency of information processing.
Fig. 2 is a schematic structural diagram of a corpus classifying device according to an embodiment of the present application. Referring to fig. 2, in a software implementation, the apparatus 800 for classifying corpora in a picture may include: an extraction module 801, a translation module 802, a matching module 803, and a partitioning module 804, wherein,
an extraction module 801, configured to extract text corpora of each predetermined category, respectively, to obtain feature words corresponding to the text corpora;
a translation module 802, configured to translate the feature words according to target languages, and form translation vectors corresponding to each determined category according to the obtained translations and vector features corresponding to the feature words; the translation vector is used for describing the characteristic attribute corresponding to the characteristic word under each set category in the target language;
a matching module 803, configured to extract candidate words corresponding to the candidate corpus to form candidate vectors, and match the candidate vectors with the translation vectors corresponding to the determined categories, respectively, to obtain matching degrees of the candidate vectors and the translation vectors of each determined category;
and the dividing module 804 is configured to determine a target category to which the candidate corpus belongs according to the matching degree.
The extraction module 801 includes:
the word segmentation unit is used for segmenting words of the text corpus and counting keywords obtained after word segmentation;
the association unit is used for searching the similar meaning words or the associated words corresponding to the keywords respectively and counting the vector characteristics corresponding to the keywords;
and the screening unit is used for respectively setting weights corresponding to the keywords according to the vector characteristics, and screening according to the weights to obtain the characteristic words corresponding to the text corpus.
The corpus classification device 800 further includes a model unit, which specifically includes:
the extracting unit is used for extracting the weight probability corresponding to each feature word in the translation vector;
the training unit is used for carrying out iterative training by taking the vector characteristics corresponding to the characteristic words as sample characteristics to obtain the language model;
and the matching unit is used for matching the language model as a translation vector corresponding to each set category with the candidate vector respectively.
The translation module 802 specifically includes:
and the association unit is used for extracting the vector characteristics corresponding to the characteristic words to obtain the vector characteristics corresponding to the translation, and forming the translation vector after the translation and the characteristic words are associated and combined.
In the translation module 802, when the translation number of the feature word corresponding to the target language is greater than one, each translation is respectively associated and combined with the feature word, the weights in the vector features corresponding to the feature word are equally divided, and each translation is respectively associated and combined with the feature word to form a plurality of sets of corresponding translation vectors.
The matching module 803 specifically includes:
the analysis unit is used for analyzing the candidate corpus and respectively extracting candidate words in the candidate corpus,
the weight distribution unit is used for respectively extracting the characteristic attributes corresponding to the candidate words and the weights corresponding to the characteristic attributes to respectively obtain the vector characteristics corresponding to the candidate words;
and the fitting unit is used for fitting the vector characteristics corresponding to the candidate words to form candidate vectors corresponding to the candidate corpus.
In short, the above description is only a preferred embodiment of the present disclosure, and is not intended to limit the scope of the present disclosure. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present specification shall be included in the protection scope of the present specification.
The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

Claims (9)

1. A method for corpus classification, the method comprising:
extracting the text corpora of each set category respectively to obtain the characteristic words corresponding to the text corpora, wherein the set categories represent the categories of the text information, and the categories of the text information are obtained by classifying the text information of the set languages;
translating the feature words according to target languages respectively, and forming translation vectors corresponding to all established categories according to the obtained translations and the vector characteristics corresponding to the feature words respectively; the translation vector is used for describing the characteristic attribute corresponding to the characteristic word under each set category in the target language;
extracting corresponding candidate words in the candidate corpus of the target language to form candidate vectors, and respectively matching the candidate vectors with the translation vectors corresponding to the established categories to obtain matching degrees of the candidate vectors and the translation vectors of each established category;
determining the target category to which the candidate corpus belongs according to the matching degree;
when the candidate vectors are respectively matched with the translation vectors corresponding to the established categories, the method further comprises the following steps:
extracting the weight probability corresponding to each characteristic word in the translation vector, wherein the translation vector corresponding to the text corpus files of the set category is
Figure FDA0003396178410000011
VdocIs a translation vector corresponding to a text corpus doc of a given category, n is the number of characteristic words in the text corpus doc, i is 1, 2, 3, …, n, VwiIs a characteristic word wiFeature attributes corresponding to feature word translations, GwiIs a characteristic word wiCorresponding weight, HwiIs a characteristic word wiThe weight of the corresponding translation;
taking the difference value of the vector characteristics corresponding to each characteristic word in the text corpus under each set category and the vector characteristics corresponding to the translation vector as a positive sample; taking the difference value of the vector characteristics corresponding to the non-characteristic words in the text corpus under each set category and the vector characteristics corresponding to the translation vectors as a negative sample; respectively carrying out iterative training in each text corpus of a set type according to sample characteristics to obtain a language model;
and matching the language model as a translation vector corresponding to each set category with the candidate vector respectively.
2. The method of claim 1, comprising: when extracting from each text corpus of the predetermined category,
performing word segmentation on the text corpus, and counting keywords obtained after word segmentation;
searching similar words or associated words corresponding to the keywords respectively, and counting vector characteristics corresponding to the keywords;
and respectively setting weights corresponding to the keywords according to the vector characteristics, and screening according to the weights to obtain characteristic words corresponding to the text corpus.
3. The method according to claim 1, wherein said composing the translation vector corresponding to each of the predetermined classes,
and when the translation corresponding to the feature words in the target language is more than one, respectively associating and combining each translation with the feature words, equally dividing the weight in the vector features corresponding to the feature words, and respectively associating and combining each translation with the feature words to form a plurality of groups of corresponding translation vectors.
4. The method according to claim 1, wherein when said extracting corresponding candidate words in said corpus of candidate words in said target language to form candidate vectors,
analyzing the candidate corpus, respectively extracting candidate words therein,
respectively extracting the characteristic attributes corresponding to the candidate words and the weights corresponding to the characteristic attributes respectively to obtain vector characteristics corresponding to the candidate words respectively;
and fitting the vector characteristics corresponding to the candidate words to form candidate vectors corresponding to the candidate corpus.
5. The method of claim 1, wherein when matching the candidate vectors with the translation vectors corresponding to each predetermined category,
extracting vector characteristics corresponding to the candidate words in the candidate vectors;
matching the vector characteristics corresponding to the candidate words with the vector characteristics corresponding to each translation vector;
screening out a given category larger than a given threshold value according to the obtained matching degree;
and taking the determined category larger than the determined threshold value as a target category to which the candidate corpus belongs.
6. An apparatus for corpus classification, the apparatus comprising:
the extraction module is used for respectively extracting the text corpora of each set category to obtain the characteristic words corresponding to the text corpora, wherein the set categories represent the categories of the text information, and the categories of the text information are obtained by classifying the text information of the set languages;
the translation module is used for translating the feature words according to target languages respectively and forming translation vectors corresponding to all the established categories according to the obtained translations and the vector characteristics corresponding to the feature words respectively; the translation vector is used for describing the characteristic attribute corresponding to the characteristic word under each set category in the target language;
the matching module is used for extracting corresponding candidate words in the candidate corpus of the target language to form candidate vectors, and matching the candidate vectors with the translation vectors corresponding to the established categories respectively to obtain the matching degrees of the candidate vectors and the translation vectors of each established category;
the dividing module is used for determining the target category to which the candidate corpus belongs according to the matching degree;
the device further comprises a model unit, and the model unit specifically comprises:
an extracting unit, configured to extract weight probabilities corresponding to the feature words in the translation vector, where the translation vector corresponding to the text corpus of the given category is
Figure FDA0003396178410000031
VdocIs a translation vector corresponding to a text corpus doc of a given category, n is the number of characteristic words in the text corpus doc, i is 1, 2, 3, …, n, VwiIs a characteristic word wiFeature attributes corresponding to feature word translations, GwiIs a characteristic word wiCorresponding weight, HwiIs a characteristic word wiThe weight of the corresponding translation;
the training unit is used for taking the difference value of the vector characteristics corresponding to each feature word in the text corpus under each set category and the vector characteristics corresponding to the translation vector as a positive sample; taking the difference value of the vector characteristics corresponding to the non-characteristic words in the text corpus under each set category and the vector characteristics corresponding to the translation vectors as a negative sample; respectively carrying out iterative training in each text corpus of a set type according to sample characteristics to obtain a language model;
and the matching unit is used for matching the language model as a translation vector corresponding to each set category with the candidate vector respectively.
7. The device according to claim 6, characterized in that the extraction module, in particular comprising,
the word segmentation unit is used for segmenting words of the text corpus and counting keywords obtained after word segmentation;
the association unit is used for searching the similar meaning words or the associated words corresponding to the keywords respectively and counting the vector characteristics corresponding to the keywords;
and the screening unit is used for respectively setting weights corresponding to the keywords according to the vector characteristics, and screening according to the weights to obtain the characteristic words corresponding to the text corpus.
8. The apparatus of claim 6, wherein: in the translation module, when the translation corresponding to the feature word in the target language is more than one, each translation is respectively associated and combined with the feature word, the weights in the vector features corresponding to the feature word are equally divided, and each translation is respectively associated and combined with the feature word to form a plurality of groups of corresponding translation vectors.
9. The apparatus according to claim 6, wherein the matching module specifically includes:
the analysis unit is used for analyzing the candidate corpus and respectively extracting candidate words in the candidate corpus,
the weight distribution unit is used for respectively extracting the characteristic attributes corresponding to the candidate words and the weights corresponding to the characteristic attributes to respectively obtain the vector characteristics corresponding to the candidate words;
and the fitting unit is used for fitting the vector characteristics corresponding to the candidate words to form candidate vectors corresponding to the candidate corpus.
CN201910468030.8A 2019-05-30 2019-05-30 Corpus classification method and apparatus Active CN110196910B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910468030.8A CN110196910B (en) 2019-05-30 2019-05-30 Corpus classification method and apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910468030.8A CN110196910B (en) 2019-05-30 2019-05-30 Corpus classification method and apparatus

Publications (2)

Publication Number Publication Date
CN110196910A CN110196910A (en) 2019-09-03
CN110196910B true CN110196910B (en) 2022-02-15

Family

ID=67753486

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910468030.8A Active CN110196910B (en) 2019-05-30 2019-05-30 Corpus classification method and apparatus

Country Status (1)

Country Link
CN (1) CN110196910B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111522927B (en) * 2020-04-15 2023-07-14 北京百度网讯科技有限公司 Entity query method and device based on knowledge graph
CN112307210A (en) * 2020-11-06 2021-02-02 中冶赛迪工程技术股份有限公司 Document tag prediction method, system, medium and electronic device
CN112417153B (en) * 2020-11-20 2023-07-04 虎博网络技术(上海)有限公司 Text classification method, apparatus, terminal device and readable storage medium
CN112836045A (en) * 2020-12-25 2021-05-25 中科恒运股份有限公司 Data processing method and device based on text data set and terminal equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2011100862A1 (en) * 2010-02-22 2011-08-25 Yahoo! Inc. Bootstrapping text classifiers by language adaptation
CN103902619A (en) * 2012-12-28 2014-07-02 ***通信集团公司 Internet public opinion monitoring method and system
CN108536756A (en) * 2018-03-16 2018-09-14 苏州大学 Mood sorting technique and system based on bilingual information

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101667177B (en) * 2009-09-23 2011-10-26 清华大学 Method and device for aligning bilingual text
CN106649288B (en) * 2016-12-12 2020-06-23 北京百度网讯科技有限公司 Artificial intelligence based translation method and device
CN108460396B (en) * 2017-09-20 2021-10-15 腾讯科技(深圳)有限公司 Negative sampling method and device
CN108510977B (en) * 2018-03-21 2020-05-22 清华大学 Language identification method and computer equipment

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2011100862A1 (en) * 2010-02-22 2011-08-25 Yahoo! Inc. Bootstrapping text classifiers by language adaptation
CN103902619A (en) * 2012-12-28 2014-07-02 ***通信集团公司 Internet public opinion monitoring method and system
CN108536756A (en) * 2018-03-16 2018-09-14 苏州大学 Mood sorting technique and system based on bilingual information

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于双语的事件抽取方法研究;朱珠;《中国优秀硕士学位论文全文数据库信息科技辑》;20170215;第I138-4481页 *

Also Published As

Publication number Publication date
CN110196910A (en) 2019-09-03

Similar Documents

Publication Publication Date Title
CN110196910B (en) Corpus classification method and apparatus
CN108287858B (en) Semantic extraction method and device for natural language
CN107291723B (en) Method and device for classifying webpage texts and method and device for identifying webpage texts
Rao et al. Classifying latent user attributes in twitter
CN109460455B (en) Text detection method and device
TWI536181B (en) Language identification in multilingual text
CN105760363B (en) Word sense disambiguation method and device for text file
CN110162778B (en) Text abstract generation method and device
CN112364624B (en) Keyword extraction method based on deep learning language model fusion semantic features
CN112434533B (en) Entity disambiguation method, entity disambiguation device, electronic device, and computer-readable storage medium
CN110287784B (en) Annual report text structure identification method
CN110287314A (en) Long text credibility evaluation method and system based on Unsupervised clustering
Patel et al. Dynamic lexicon generation for natural scene images
CN112069312A (en) Text classification method based on entity recognition and electronic device
CN114880496A (en) Multimedia information topic analysis method, device, equipment and storage medium
CN108345694B (en) Document retrieval method and system based on theme database
Hamid et al. Bengali slang detection using state-of-the-art supervised models from a given text
CN111133429A (en) Extracting expressions for natural language processing
CN112765977A (en) Word segmentation method and device based on cross-language data enhancement
CN112183093A (en) Enterprise public opinion analysis method, device, equipment and readable storage medium
JP6426074B2 (en) Related document search device, model creation device, method and program thereof
CN109918661B (en) Synonym acquisition method and device
CN108427769B (en) Character interest tag extraction method based on social network
CN111159405A (en) Irony detection method based on background knowledge
CN110955845A (en) User interest identification method and device, and search result processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20220728

Address after: No.16 and 17, unit 1, North District, Kailin center, No.51 Jinshui East Road, Zhengzhou area (Zhengdong), Henan pilot Free Trade Zone, Zhengzhou City, Henan Province, 450000

Patentee after: Zhengzhou Apas Technology Co.,Ltd.

Address before: E301-27, building 1, No.1, hagongda Road, Tangjiawan Town, Zhuhai City, Guangdong Province

Patentee before: ZHUHAI TIANYAN TECHNOLOGY Co.,Ltd.

TR01 Transfer of patent right