CN110399490A

CN110399490A - A kind of barrage file classification method, device, equipment and storage medium

Info

Publication number: CN110399490A
Application number: CN201910644651.7A
Authority: CN
Inventors: 王姣
Original assignee: Wuhan Douyu Network Technology Co Ltd
Current assignee: Wuhan Douyu Network Technology Co Ltd
Priority date: 2019-07-17
Filing date: 2019-07-17
Publication date: 2019-11-01

Abstract

The present invention proposes a kind of barrage file classification method, device, equipment and storage medium, which comprises obtains the uneven training dataset for having marked classification in advance, training dataset is divided into adequate sample and inadequate sample；The adequate sample is trained using textCNN model；Model training is carried out using SVM classifier to the inadequate sample；Text to be measured is carried out to input trained textCNN model, exports class probability of all categories in adequate sample；If the trained SVM classifier of text input to be measured is exported the classification of prediction less than the first preset threshold by the class probability of the output.The present invention is separately trained according to training sample size, obtain the disaggregated model for different text scales, then both disaggregated models are combined and is used for text classification to be measured, solves training sample data imbalance problem, compared to single model training, the risk of over-fitting can be reduced, biasing is reduced, recognition accuracy is higher.

Description

A kind of barrage file classification method, device, equipment and storage medium

Technical field

The invention belongs to big data technical fields, and in particular to a kind of barrage file classification method, device, equipment and storage Medium.

Background technique

On live streaming platform, Hei Chan clique can largely send advertisement barrage or pornographic barrage in direct broadcasting room, to reach brokenly Bad platform living broadcast environment, illegally seeks the purpose of illegitimate benefits, not only greatly destroys user experience, directly and indirectly The broken interests for having changed platform and normal users.The current interception that rubbish barrage may be implemented by training text model, but When training rubbish barrage textual classification model, it is frequently encountered the data of typical " sixteen distributions ", training sample distribution Unevenly, sample to be tested is always divided into more that of sample size by the problem for causing disaggregated model ineffective, such as model One kind, so that minority class sample can not be correctly validated.

The processing method of common unbalanced data is to Small Sample Database over-sampling and to carry out owing to adopt to big-sample data The operation of sample, the deficiency of this method are that over-sampling scale is difficult to hold, and will cause model training biasing.Either certain samples This leads to the recall rate decline of identification due to not having study to arrive.The model of common processing non-equilibrium data has random forest etc. Tree-model, but this class model needs a large amount of manual features to extract, and takes time and effort, does not also have neural network model classifying quality Significantly.

Summary of the invention

The present invention proposes a kind of barrage file classification method, device, equipment and storage medium, for solving live streaming barrage text The training sample distribution uneven problem for causing disaggregated model ineffective when this classification.

First aspect present invention proposes a kind of barrage file classification method, which comprises

S1, acquisition have marked the uneven training dataset of classification in advance, the sample size of each classification are counted, according to institute The training dataset is divided into adequate sample and inadequate sample by the sample size for stating each classification；

S2, the adequate sample is pre-processed, pretreated barrage text conversion is expressed at based on word vector Sentence matrix, pass through the sentence matrix training textCNN model；

S3, the inadequate sample is pre-processed, feature extraction and text representation is carried out by TF-IDF method, obtained To the set of eigenvectors of the inadequate sample, pass through described eigenvector collection training SVM classifier；

S4, by the trained textCNN model of text input to be measured, export class probability of all categories in adequate sample； If the class probability of the output is less than the first preset threshold, by the trained SVM classifier of text input to be measured, output The classification of prediction.

Optionally, after the step S4 further include:

The barrage ratio for each classification that S5, counting user are sent carries out user tag division and height according to the ratio The user that endangers identifies.

Optionally, in the step S1, if training data, which is concentrated, shares C classification, C >=4.Optionally, the step S1 In, method that the training dataset is divided into adequate sample and inadequate sample by the sample size according to each classification Include: in each classification sample size be more than that the second preset threshold belongs to adequate sample, be less than or equal to the described second default threshold Value belongs to inadequate sample；Or the distribution function F of fitting all categories sample size, the inflection point of F is sought, then is greater than described turn The sample of point quantity belongs to adequate sample, and the sample less than or equal to the inflection point quantity belongs to inadequate sample.

Optionally, the detailed process of the step S2 are as follows:

S21, pretreatment: the spcial character or other messy codes in barrage text are filtered out；

S22, vectorization indicate: utilizing CBOW model, be based on context Character prediction core character, obtain word vector, root Barrage text representation is formed a complete sentence submatrix according to the word vector；

S23, training textCNN model: the sentence matrix is subjected to feature extraction by one-dimensional convolutional layer, by pond The sentence of different length is become the expression of fixed length by layer, and the softmax function through full articulamentum exports the barrage text and belongs to The probability of each classification；

The probability that barrage text vector x belongs to a-th of classification is

Wherein w_a、w_bFor weight coefficient, y is output valve, and a, b=1,2 ..., M, M are that the classification in the adequate sample is total Number.

Optionally, in the step S3, it is described using TF-IDF method to the barrage text in the inadequate sample into Row feature extraction and text representation specifically:

S31, the frequency that occurs in the inadequate sample of statistics word, by the word in the inadequate sample according to The frequency descending sort chooses top n word as Feature Words, N >=1；

S32, the weight that Feature Words are calculated by TF-IFD, wherein TF is word frequency, its calculation formula is:

Wherein, TF_i,jIt is characterized the frequency of appearance of the word j in text i, n_i,jIt is characterized time that word j occurs in text i Number, k are different terms number in text i；

IDF is inverse document frequency, its calculation formula is:

Wherein, | D | it is the total sample number in the inadequate sample D,It indicates in inadequate sample comprising spy The text sum of word j is levied, j is some word after text word cutting, t_iFor the text comprising word j, d_jFor the inadequate sample Some text in D；

The corresponding weight w of Feature Words j_j=TF_i,j×IDF_i；

The feature vector of each barrage text in S33, the building inadequate sample, by each key in barrage text For word as a dimension in vector space, the value in dimension is the weight of corresponding keyword, obtains the inadequate sample Set of eigenvectors.

Second aspect of the present invention, provides a kind of barrage document sorting apparatus, and described device includes:

Sample division module: for defining barrage classification, the sample for each classification that training data is concentrated is divided into and is filled Sufficient sample and inadequate sample；

Separate training module: for being pre-processed to the adequate sample, by pretreated barrage text conversion at Based on the sentence matrix of word vector expression, pass through sentence matrix training textCNN model；The inadequate sample is carried out Pretreatment, carries out feature extraction and expression by TF-IDF method, carries out model training using SVM classifier；

Text classification module: each in adequate sample for exporting the trained textCNN model of text input to be sorted The class probability of classification；If the class probability of the output trains the text input to be measured less than the first preset threshold SVM classifier, export the classification of prediction；

Subscriber identification module: it for the barrage ratio for each classification that counting user is sent, is used according to the ratio Family label divides and high-risk user identification.

Optionally, the training module that separates specifically includes:

Adequate sample training unit: the spcial character or other messy codes in barrage text are filtered out；Utilize CBOW model, base In context Character prediction core character, word vector is obtained, according to the word vector by each barrage text representation at sentence square Battle array；According to sentence matrix training textCNN model；

Inadequate sample training unit: pre-processing text, counts the frequency that each word occurs in the inadequate sample Rate sorts the word in the inadequate sample according to the frequency, chooses top n word as Feature Words and composition characteristic word Allusion quotation, N >=1；The weight of selected Feature Words is calculated using TF-IDF method；Construct each barrage text in the inadequate sample Feature vector, using each keyword in barrage text as a dimension in vector space, the value in dimension is the key The weight of word obtains the set of eigenvectors of inadequate sample；According to described eigenvector collection training SVM classifier.

Third aspect present invention, provides a kind of computer equipment, including memory, processor and is stored in the storage In device and the computer program that can run on the processor, which is characterized in that the processor executes the computer journey The step of barrage file classification method described in first aspect present invention is realized when sequence.

Fourth aspect present invention provides a kind of computer readable storage medium, the computer-readable recording medium storage There is computer program, which is characterized in that bullet described in first aspect present invention is realized when the computer program is executed by processor The step of curtain file classification method.

Uneven training dataset is divided into adequate sample and inadequate sample and separates training by the present invention, is directed to Then both sub-models are combined to form a performance more by the disaggregated model of different text scales in some way Powerful classifier is used for text classification to be measured, efficiently solves live streaming barrage text training sample classification ratio data distribution The problem of unevenness is brought can reduce the risk of over-fitting compared to single model training, reduce biasing, recognition accuracy is more It is high.

Detailed description of the invention

It, below will be to needed in the technology of the present invention description in order to illustrate more clearly of technical solution of the present invention Attached drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for ability For the those of ordinary skill of domain, without any creative labor, it can also be obtained according to these attached drawings others Attached drawing.

Fig. 1 is the flow diagram of barrage file classification method provided in an embodiment of the present invention；

Fig. 2 is the general flow chart of model training provided in an embodiment of the present invention and text classification；

Fig. 3 is barrage document sorting apparatus structural schematic diagram provided in an embodiment of the present invention；

Fig. 4 is a kind of computer equipment structural schematic diagram provided in an embodiment of the present invention.

Specific embodiment

The present invention proposes a kind of barrage file classification method, device, equipment and storage medium, utilizes the side of model combination Method assembles existing disaggregated model by certain way, forms the more powerful classifier of performance, improves classification Quasi- curvature.

In order to make the invention's purpose, features and advantages of the invention more obvious and easy to understand, below in conjunction with the present invention Attached drawing in embodiment, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that disclosed below Embodiment be only a part of the embodiment of the present invention, and not all embodiment.Based on the embodiments of the present invention, this field Those of ordinary skill's all other embodiment obtained without making creative work, belongs to protection of the present invention Range.

Class imbalance problem is a common problem, specific manifestation are as follows: data set during obtaining data set In the quantity of a certain classification sample and the quantity of remaining classification sample differ greatly.Such as on live streaming platform, the amount of normal barrage It is, if not going accordingly to improve data set or algorithm, directly to carry out classification based training far more than other types barrage, As a result, minority class sample data cannot adequately be paid attention to, even device can be classified under serious conditions and is neglected as noise Slightly, so as to cause the severe deviations of classification results.The application, which is used, is divided into adequate sample and not for uneven training dataset Adequate sample, and in such a way that adequate sample and inadequate sample are separately trained, to solve training number in barrage text classification According to collection imbalance problem, and then carry out barrage text classification.

Referring to Fig. 1, the present invention proposes a kind of barrage file classification method, which comprises

Barrage text collection is obtained, the mode manually marked can be used and marked each bullet previously according to barrage content of text The classification of curtain text forms uneven training dataset.If training data, which is concentrated, shares C classification, then C >=4 are so as to separated instruction Practice, for example can be used the barrage classification on live streaming platform at present: normal barrage, the black barrage in region, sprayer barrage, pornographic barrage, Advertisement barrage relates to seven classes such as political affairs barrage, other types barrage as class label.

In general, the amount of normal barrage be far more than other types barrage, therefore can by training data concentrate sample This is divided into adequate sample and inadequate sample according to the sample size of each classification, i.e., using most classes as adequate sample, will lack It is several classes of to be used as inadequate sample.Specific division methods can there are two, first method is rule of thumb, by sample in each classification Quantity is more than that the classification of the second preset threshold belongs to adequate sample, for example normal barrage, sprayer barrage, sample size are less than or equal to The classification of second preset threshold is inadequate sample, for example relate to political affairs barrage etc., second preset threshold can be set as 1000 Item.Second method is the distribution function F for being fitted all categories sample size, and abscissa is classification, and ordinate is sample size, The inflection point of F, such as the ordinate value q=1200 of inflection point are asked, then the sample greater than q belongs to adequate sample, the sample less than or equal to q Data belong to inadequate sample.

Further, the detailed process of the step S2 are as follows:

S21, pretreatment: the spcial character or other messy codes in barrage text are filtered out；Specifically, pretreatment further includes pair Barrage text such as is segmented, removes stop words at the processing.

S22, vectorization indicate: utilizing CBOW model, be based on context Character prediction core character, obtain word vector, root Barrage text representation is formed a complete sentence submatrix according to the word vector；Specifically, being based on context Character prediction core using CBOW model Heart character obtains the vector of single character, and contextual window size is 10, and the word vector dimension trained is 200, by barrage table It is shown as sentence matrix, every row is word vector, and dimension 200, this can analogize to the original image vegetarian refreshments in image.

Specifically, the corpus of vectorization passes through the one-dimensional convolutional layer of filter_size=(2,4,6,8), lead to respectively here It crosses four sliding windows of different sizes while carrying out the extraction of feature, the feature of extraction can be with bit-wise addition, can also be one Spliced in dimension, realizes feature extraction and fusion abundant.The second layer is pooling layers of max, such different length Sentence is most followed by the softmax connected entirely layer, exports each class by that can become illustrating for fixed length after pooling layers Other probability.Softmax is in more assorting processes, it is assumed that and we have array a V, Vi to indicate i-th of element in V, that The softmax value of this element is exactly

The probability that barrage text vector x belongs to a-th of classification is

The classification problem of extensive text can be solved by training the textCNN model come above.

S3, the inadequate sample is pre-processed, feature extraction and text representation is carried out by TF-IDF method, obtained To the set of eigenvectors of the inadequate sample, pass through described eigenvector collection training；SVM classifier is specifically, pretreated Method is identical as step S21's.

Further, in the step S3, it is described using TF-IDF method to the barrage text in the inadequate sample Carry out feature extraction and text representation specifically:

Specifically, the selection of Feature Words is Statistics-Based Method, the frequency that statistics word occurs in all training texts The frequency of rate, appearance is bigger, illustrates that word usage is higher, considers in conjunction with memory is calculated, and sorts according to the frequency of word, the value of N Set according to demand, for example, before optional selection N=400000 word as Feature Words.

IDF is inverse document frequency, its calculation formula is:

The corresponding weight w of Feature Words j_j=TF_i,j×IDF_i；

Specifically, the thought of reference Bayes, has done following processing to IDF is calculated, has occurred the document of word in order to prevent It is not present, denominator becomes 0, adds 1 after the sum of document is unified in when calculating.Due to being converted using log, add after computation One minimum is to add 0.00001 here, and when preventing from calculating the weight of word, TF value is multiplied with IDF value becomes 0, influences term vector Expression.

The feature vector of each barrage text in S33, the building inadequate sample, by each key in barrage text For word as a dimension in vector space, the value in dimension is the weight of corresponding keyword, obtains the inadequate sample Set of eigenvectors.The vector expression of each barrage text in the inadequate sample is completed with this.

After completing term weight function building, SVM algorithm is selected, trains probability of each Feature Words in each classification, SVM is the distance for calculating vector and reaching hyperplane, and this distance can indicate the probability of barrage generic, than general text This disaggregated model such as Bayes will be applicable in this scene.The SVM classifier is surpassed by calculating the feature vector, X of barrage text and reaching The distance d of plane classifies:

Wherein, X=(x₁, x₂..., x_n) be barrage text feature vector, x_jThe vector for being characterized word j is expressed, j=1, 2 ..., n, n are the word number of the barrage text, weight vectors W=(w₁, w₂..., w_n), W^TFor the transposition of W, w_jIt is characterized j pairs of word The weight answered, b are constant, are determined according to Feature Words respective weights；The distance d is the general of reflection barrage text X generic Rate.

Specifically, text to be measured, that is, barrage data, text to be measured also passes through pretreatment and the step S22 of step S21 Vectorization indicate after input trained textCNN model again.First preset threshold can be set as threshold= 0.3, threshold value is obtained according to inductive statistics, when textCNN model text classification Probability p < 0.3, can be existed and practical class The case where other deviation increases is inputted the model of SVM classifier for the sample to be tested for being lower than threshold value, exports final prediction Classification.It is greater than the sample of first preset threshold, for the class probability of textCNN model output with textCNN model Subject to classification.

Further, after the step S4 further include:

The barrage ratio for each classification that S5, counting user are sent carries out user's identification according to the ratio and label is drawn Point.

Specifically, the control to barrage is utilized on the behavior label for may map to user on live streaming platform The method of statistics sets the third threshold value of each classification, if the barrage for a certain classification that user sends is more than the third of the category Threshold value then sticks corresponding behavior label for user, for example the accounting of normal barrage that user sends reaches 100%, then can It is a high-quality user to define this user, black 60% or more the barrage accounting in region that user sends, this kind of user can determine Justice is sprayer user, and pornographic user's accounting that user sends is more than 20%, and this kind of user can be defined as high-risk user, utilize system The strategy of meter can be tagged to the user for sending out barrage, these labels can be not only used for the behavior limitation of user.It can be with Label is based further on to analyze the payment of user, viewing duration.

Referring to Fig. 2, the general flow chart of model training and text classification of the invention, in model training stage, the present invention Training dataset is divided into adequate sample and inadequate sample and separates training, obtains the classification mould for different text scales Type.Then in the text classification stage, it is stronger that both disaggregated models are combined to form into a performance in some way Big classifier is used for text classification to be measured.The present invention completes the classification of rubbish barrage text using the method for model combination, Advantage is that the risk of over-fitting can be reduced.With the different data of single model training, the biasing of final mask will increase, and The present invention assembles existing disaggregated model by certain way, can form the more powerful classifier of performance, i.e., The method that Weak Classifier is assembled into strong classifier.In addition the successive problem of model is based on basic assumption, and one unknown Text, from statistical angle, it is bigger for belonging to the probability of normal sample, so the combined sequence of model is first sufficient sample The model of inadequate sample training after the model of this training, i.e. SVM classifier after elder generation's textCNN model.

Referring to Fig. 3, the present invention provides a kind of barrage document sorting apparatus, described device includes:

Sample division module 310: the uneven training dataset for having marked classification in advance is obtained, the sample of each classification is counted The training dataset is divided into adequate sample and inadequate sample according to the sample size of each classification by this quantity；

Separate training module 320: for pre-processing to the adequate sample, by pretreated barrage text conversion At the sentence matrix expressed based on word vector, pass through sentence matrix training textCNN model；To the inadequate sample into Row pretreatment carries out feature extraction and expression by TF-IDF method, obtains the set of eigenvectors of inadequate sample；According to described Set of eigenvectors trains SVM classifier；

Text classification module 330: for exporting adequate sample for the trained textCNN model of text input to be sorted In class probability of all categories；If the class probability of the output instructs the text input to be measured less than the first preset threshold The SVM classifier perfected exports the classification of prediction；

Subscriber identification module 340: it for the barrage ratio for each classification that counting user is sent, is carried out according to the ratio User's identification and label divide.

Further, the training module 320 that separates specifically includes:

Adequate sample training unit 3201: the spcial character or other messy codes in barrage text are filtered out；Utilize CBOW mould Type is based on context Character prediction core character, obtains word vector, each barrage text representation is formed a complete sentence according to the word vector Submatrix；According to sentence matrix training textCNN model；

Inadequate sample training unit 3202: pre-processing text, counts each word in the inadequate sample and occurs Frequency choose top n word as Feature Words by the word in the inadequate sample according to the frequency descending sort, N >= 1；The weight of selected Feature Words is calculated using TF-IDF method；Construct the feature of each barrage text in the inadequate sample to Amount, using each keyword in barrage text as a dimension in vector space, the value in dimension is the power of the keyword Weight, obtains the set of eigenvectors of inadequate sample；According to described eigenvector collection training SVM classifier.

Refer to Fig. 4, a kind of computer equipment structural schematic diagram provided by the invention, comprising: memory 410, processor 420 and system bus 430, the memory 410 includes the program 4101 run of storage thereon, those skilled in the art It is appreciated that terminal device structure shown in Fig. 4 does not constitute the restriction to terminal device, may include than illustrate it is more or Less component perhaps combines certain components or different component layouts.

It is specifically introduced below with reference to each component parts of the Fig. 4 to terminal device:

Memory 410 can be used for storing software program and module, and processor 420 is stored in memory 410 by operation Software program and module, thereby executing the various function application and data processing of terminal.Memory 410 can mainly include Storing program area and storage data area, wherein storing program area can application journey needed for storage program area, at least one function Sequence (such as sound-playing function, image player function etc.) etc.；Storage data area can be stored to be created according to using for terminal Data (such as audio data, phone directory etc.) etc..It, can be in addition, memory 410 may include high-speed random access memory Including nonvolatile memory, for example, at least a disk memory, flush memory device or other volatile solid-states Part.

Program 4101 is run comprising barrage file classification method on memory 410, it is described to run program 4101 One or more module/units can be divided into, one or more of module/units are stored in the memory 410 In, and executed by processor 420, with the transmitting of completion notice and obtain notice realization process, one or more of modules/mono- Member can be the series of computation machine program instruction section that can complete specific function, and the instruction segment is for describing the computer journey Implementation procedure of the sequence 4101 in the equipment described in Fig. 4.For example, the computer program 4101, which can be divided into sample, divides mould Block, separately training module, text classification module, subscriber identification module etc..

Processor 420 is the control centre of terminal device, utilizes each of various interfaces and the entire terminal device of connection A part by running or execute the software program and/or module that are stored in memory 410, and calls and is stored in storage Data in device 410 execute the various functions and processing data of terminal, to carry out integral monitoring to terminal.Optionally, it handles Device 420 may include one or more processing units；Preferably, processor 420 can integrate application processor and modulation /demodulation processing Device, wherein the main processing operation system of application processor, application program etc., modem processor mainly handles wireless communication. It is understood that above-mentioned modem processor can not also be integrated into processor 420.

System bus 430 is for connection to each functional component of computer-internal, can with data information, address information, Information is controlled, type can be such as pci bus, isa bus, VESA bus.The instruction of processor 420 is passed by bus It is handed to memory 410,410 feedback data of memory is responsible for processor 420 and memory to processor 420, system bus 430 Data, instruction interaction between 410.Certain system bus 430 can also access other equipment, such as network interface, display are set It is standby etc..

The terminal device should include at least CPU, chipset, memory, disk system etc., other component parts are herein no longer It repeats.

In embodiments of the present invention, what processor 420 included by the terminal executed runs program specifically:

S4, text to be measured is carried out to input trained textCNN model, it is general exports classification of all categories in adequate sample Rate；If the class probability of the output is less than the first preset threshold, by the trained SVM classifier of text input to be measured, Export the classification of prediction；

The present invention is utilized respectively different models and removes training sample according to the size of training sample, and deep learning is come It says, the more classifying qualities of sample are better, and in the case that sample size is few, effect is not as good as traditional machine learning, so originally Patent on last decision logic, and passes through combination finally by the model come is trained using the thought divided and rule Method, judge text belong to which kind of probability it is higher, to realize text classification, be further used for user's identification.

It is apparent to those skilled in the art that for convenience and simplicity of description, the system of foregoing description, The specific work process of device and unit, can refer to corresponding processes in the foregoing method embodiment, and details are not described herein.

In the above-described embodiments, it all emphasizes particularly on different fields to the description of each embodiment, is not described in detail or remembers in some embodiment The part of load may refer to the associated description of other embodiments.

Those of ordinary skill in the art may be aware that each embodiment described in conjunction with the examples disclosed in this document Module, unit and/or method and step can be realized with the combination of electronic hardware or computer software and electronic hardware.This A little functions are implemented in hardware or software actually, the specific application and design constraint depending on technical solution.Specially Industry technical staff can use different methods to achieve the described function each specific application, but this realization is not It is considered as beyond the scope of this invention.

In several embodiments provided herein, it should be understood that disclosed system, device and method can be with It realizes by another way.For example, the apparatus embodiments described above are merely exemplary, for example, the unit It divides, only a kind of logical function partition, there may be another division manner in actual implementation, such as multiple units or components It can be combined or can be integrated into another system, or some features can be ignored or not executed.Another point, it is shown or The mutual coupling, direct-coupling or communication connection discussed can be through some interfaces, the indirect coupling of device or unit It closes or communicates to connect, can be electrical property, mechanical or other forms.

The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple In network unit.It can select some or all of unit therein according to the actual needs to realize the mesh of this embodiment scheme 's.

It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list Member both can take the form of hardware realization, can also realize in the form of software functional units.

If the integrated unit is realized in the form of SFU software functional unit and sells or use as independent product When, it can store in a computer readable storage medium.Based on this understanding, the present invention realizes above-described embodiment side All or part of the process in method can also instruct relevant hardware to complete, the computer by computer program Program can be stored in a computer readable storage medium, and the computer program is when being executed by processor, it can be achieved that above-mentioned each The step of a embodiment of the method.Wherein, the computer program includes computer program code, and the computer program code can Think source code form, object identification code form, executable file or certain intermediate forms etc..The computer-readable medium can be with It include: any entity or device, recording medium, USB flash disk, mobile hard disk, magnetic disk, light that can carry the computer program code Disk, computer storage, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), electric carrier signal, telecommunication signal and software distribution medium etc..It should be noted that the computer The content that readable medium includes can carry out increase and decrease appropriate according to the requirement made laws in jurisdiction with patent practice, such as It does not include electric carrier signal and telecommunication signal according to legislation and patent practice, computer-readable medium in certain jurisdictions.

The above, the above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations, although referring to before Stating embodiment, invention is explained in detail, those skilled in the art should understand that: it still can be to preceding Technical solution documented by each embodiment is stated to modify or equivalent replacement of some of the technical features；And these It modifies or replaces, the spirit and scope for technical solution of various embodiments of the present invention that it does not separate the essence of the corresponding technical solution.

Claims

1. a kind of barrage file classification method, which is characterized in that the described method includes:

S1, acquisition have marked the uneven training dataset of classification in advance, count the sample size of each classification, according to described each The training dataset is divided into adequate sample and inadequate sample by the sample size of a classification；

S2, the adequate sample is pre-processed, by pretreated barrage text conversion at the sentence expressed based on word vector Submatrix passes through sentence matrix training textCNN model；

S3, the inadequate sample is pre-processed, feature extraction and text representation is carried out by TF-IDF method, obtain institute The set of eigenvectors for stating inadequate sample passes through described eigenvector collection training SVM classifier；

S4, text to be measured is carried out to input trained textCNN model, exports class probability of all categories in adequate sample； If the class probability of the output carries out the text to be measured less than the first preset threshold, using trained SVM classifier Classification.

2. barrage file classification method according to claim 1, which is characterized in that after the step S4 further include:

The barrage ratio for each classification that S5, counting user are sent carries out user tag according to the ratio and divides and user's knowledge Not.

3. barrage file classification method according to claim 1, which is characterized in that in the step S1, if training dataset In share C classification, C >=4.

4. barrage file classification method according to claim 1, which is characterized in that described according to each in the step S1 The method that the training dataset is divided into adequate sample and inadequate sample includes: by the sample size of classification

Sample size is more than that the second preset threshold belongs to adequate sample in each classification, is less than or equal to second preset threshold Belong to inadequate sample；Or the distribution function F of fitting all categories sample size, the inflection point of F is sought, then sample size is greater than The sample of the knee value belongs to adequate sample, and the sample less than or equal to the knee value belongs to inadequate sample.

5. barrage file classification method according to claim 1, which is characterized in that the detailed process of the step S2 are as follows:

S22, vectorization indicate: utilizing CBOW model, be based on context Character prediction core character, word vector obtained, according to institute It states word vector barrage text representation forms a complete sentence submatrix；

S23, training textCNN model: carrying out feature extraction by one-dimensional convolutional layer for the sentence matrix, will by pond layer The sentence of different length becomes the expression of fixed length, and the softmax function output barrage text through full articulamentum belongs to each The probability of classification；The probability that barrage text vector x belongs to a-th of classification is

Wherein w_a、w_bFor weight coefficient, y is output valve, and a, b=1,2 ..., M, M are the classification sum in the adequate sample.

6. barrage file classification method according to claim 1, which is characterized in that described to use TF- in the step S3 IDF method carries out feature extraction and text representation to the barrage text in the inadequate sample specifically:

The frequency that S31, each word of statistics occur in the inadequate sample, by the word in the inadequate sample according to The frequency descending sort chooses top n word as Feature Words, N >=1；

Wherein, TF_i,jIt is characterized the frequency of appearance of the word j in text i, n_i,jIt is characterized the number that word j occurs in text i, k For different terms number in text i；

IDF is inverse document frequency, its calculation formula is:

Wherein, | D | it is the total sample number in the inadequate sample D,It indicates in inadequate sample comprising Feature Words j Text sum, j be text word cutting after some word, t_iFor the text comprising word j, d_jFor in the inadequate sample D Some text；

The corresponding weight w of Feature Words j_j=TF_i,j×IDF_i；

The feature vector of each barrage text, each keyword in barrage text is made in S33, the building inadequate sample For a dimension in vector space, the value in dimension is the weight of corresponding keyword, obtains the spy of the inadequate sample Levy vector set.

7. a kind of barrage document sorting apparatus, which is characterized in that described device includes:

Sample division module: for obtaining the uneven training dataset for having marked classification in advance, the sample of each classification is counted The training dataset is divided into adequate sample and inadequate sample according to the sample size of each classification by quantity；

Separate training module: for pre-processing to the adequate sample, by pretreated barrage text conversion at being based on The sentence matrix of word vector expression passes through sentence matrix training textCNN model；The inadequate sample is located in advance Reason carries out feature extraction and expression by TF-IDF method, obtains the set of eigenvectors of the inadequate sample, pass through the spy Levy vector set training SVM classifier；

Text classification module: for exporting the trained textCNN model of text input to be sorted of all categories in adequate sample Class probability；If the class probability of the output is less than the first preset threshold, by the trained SVM of text input to be measured Classifier exports the classification of prediction；

Subscriber identification module: for the barrage ratio for each classification that counting user is sent, user's mark is carried out according to the ratio Label divide and user's identification.

8. barrage document sorting apparatus according to claim 7, which is characterized in that the training module that separates specifically includes:

Adequate sample training unit: the spcial character or other messy codes in barrage text are filtered out；Using CBOW model, based on upper Hereafter Character prediction core character obtains word vector, and each barrage text representation is formed a complete sentence submatrix according to the word vector；Root According to sentence matrix training textCNN model；

Inadequate sample training unit: pre-processing text, counts the frequency that each word occurs in the inadequate sample, will Word in the inadequate sample chooses top n word as Feature Words, N >=1 according to the frequency descending sort；Use TF- The weight of IDF method calculating Feature Words；The feature vector for constructing each barrage text in the inadequate sample, by barrage text In each keyword as a dimension in vector space, the value in dimension is the weight of the keyword, is obtained inadequate The set of eigenvectors of sample；According to described eigenvector collection training SVM classifier.

9. a kind of computer equipment, including memory, processor and storage are in the memory and can be in the processor The computer program of upper operation, which is characterized in that the processor realized when executing the computer program as claim 1 to Described in any one of 6 the step of barrage file classification method.

10. a kind of computer readable storage medium, the computer-readable recording medium storage has computer program, and feature exists In realization barrage file classification method as described in any one of claims 1 to 6 when the computer program is executed by processor The step of.