CN108959265A - Cross-domain texts sensibility classification method, device, computer equipment and storage medium - Google Patents

Cross-domain texts sensibility classification method, device, computer equipment and storage medium Download PDF

Info

Publication number
CN108959265A
CN108959265A CN201810770172.5A CN201810770172A CN108959265A CN 108959265 A CN108959265 A CN 108959265A CN 201810770172 A CN201810770172 A CN 201810770172A CN 108959265 A CN108959265 A CN 108959265A
Authority
CN
China
Prior art keywords
training set
training
feature
classifier
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN201810770172.5A
Other languages
Chinese (zh)
Inventor
秦兴德
刘奕慧
郭玮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Dingfeng Cattle Technology Co Ltd
Original Assignee
Shenzhen Dingfeng Cattle Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Dingfeng Cattle Technology Co Ltd filed Critical Shenzhen Dingfeng Cattle Technology Co Ltd
Priority to CN201810770172.5A priority Critical patent/CN108959265A/en
Publication of CN108959265A publication Critical patent/CN108959265A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention discloses one kind, the embodiment of the invention provides a kind of cross-domain texts sensibility classification method, device, computer equipment and storage mediums.Wherein, method includes: the first training set and the second training set to be merged into third training set, and carry out term vector training to third training set by term vector tool;First training set is divided into multiple sub- training sets, and merges each sub- training set with the second training set to obtain multiple feature training sets respectively;Respectively according to each feature training set one classifier of training;Integrated classifier is established according to the classifier after all training;The training set in target domain without mark is acted on by integrated classifier, to solve the problems, such as that mark sample needs to need respective mark sample by artificial acquisition and different fields in the prior art, highly shortened handling time and model construction time.In addition, the present invention realizes that simply speed is fast, precision is high, can large scale deployment.

Description

Cross-domain texts sensibility classification method, device, computer equipment and storage medium
Technical field
The present invention relates to emotional semantic classification technical field more particularly to a kind of cross-domain texts sensibility classification methods, device, meter Calculate machine equipment and storage medium.
Background technique
Emotional semantic classification is one of natural language processing main task.For emotional semantic classification, existing method focus mostly in The emotional semantic classification in single field.Its commonly used method includes machine learning classification method based on vector space model, is based on The machine learning classification method of term vector model, based on RNN (Recurrent Neural Network, Recognition with Recurrent Neural Network), CNN (Convolutional Neural Network, convolutional neural networks) even depth learning method.
When carrying out emotional semantic classification using above method, need largely to mark sample, and the acquisition for marking sample is main Manually, different fields needs respective mark sample again, time-consuming and laborious, has clearly disadvantageous.
Summary of the invention
The embodiment of the invention provides a kind of cross-domain texts sensibility classification method, device, computer equipment and storages to be situated between Matter, it is intended to which when carrying out emotional semantic classification, mark sample needs to need by artificial acquisition and different fields respective for solution The problem of marking sample.
In a first aspect, the embodiment of the invention provides a kind of cross-domain texts sensibility classification methods comprising:
First training set and the second training set are merged into third training set, and by term vector tool to third training set Term vector training is carried out to obtain the text vector of each sample and emotion label label in the third training set, wherein institute Stating the first training set is by the training set of mark in source domain, and second training set is in target domain by the instruction of mark Practice collection, further includes the training set without mark in the target domain;
First training set is divided into multiple sub- training sets, and respectively by each sub- training set and described second Training set merges, and accordingly obtains multiple feature training sets;
Label is marked to classify one according to the text vector and emotion of each feature training set and each sample respectively Device is trained;
Integrated classifier is established according to the classifier after all training;
The training set in the target domain without mark is acted on by the integrated classifier.
Its further technical solution is, described that first training set is divided into multiple sub- training sets, comprising:
The first training ensemble average is divided into multiple sub- training sets.
Its further technical solution is, described to obtain a classification according to each feature training set training respectively Device, comprising:
Initialize the initial weight of each training sample in the feature training set
Successive ignition training is carried out to the feature training set, repetitive exercise includes: each time
The training sub-classifier on the feature training setObtain classification function
(1) calculates the sub-classifier according to the following formulaClassification error rate
(2) calculate the sub-classifier according to the following formulaWeight
(3) and formula (4) obtain each training sample in the feature training set and carry out next iteration according to the following formula Trained weight
Classifier L is determined according to result from above and following formula (5)j(x);
Wherein, x is the text vector of the training sample of the feature training set, and y is the training sample of the feature training set This emotion marks label, and i is the serial number of training sample in the feature training set, and M is feature training concentration training sample This quantity, t are the number of current repetitive exercise, and T is the total degree of repetitive exercise, and j is the serial number of the feature training set,For the feature training set training sample current iteration training weight,
Its further technical solution is that the classifier obtained according to each feature training set training is established integrated Classifier, comprising:
According to formulaDetermine integrated classifier L (x), wherein m is the feature training set Quantity.
Its further technical solution is that first training set and the second training set are merged into third training set described, And term vector training is carried out to third training set to obtain the text of each sample in the third training set by term vector tool Before vector and emotion label label, the method also includes:
Word segmentation processing is carried out to first training set and second training set;
Remove the stop-word in first training set and second training set.
Second aspect, the embodiment of the invention also provides a kind of cross-domain texts emotional semantic classification devices comprising:
Acquiring unit for the first training set and the second training set to be merged into third training set, and passes through term vector work Have and term vector training is carried out to obtain the text vector and emotion mark of each sample in the third training set to third training set Remember label, wherein first training set is by the training set of mark in source domain, and second training set is target domain It is middle to pass through the training set marked, it further include the training set without mark in the target domain;
Combining unit, for first training set to be divided into multiple sub- training sets, and respectively by each sub- instruction Practice collection to merge with second training set, accordingly obtains multiple feature training sets;
Training unit, for respectively according to one classifier of each feature training set training;
Unit is established, the classifier for obtaining according to each feature training set training establishes integrated classifier;
Action cell acts on the training set in the target domain without mark by the integrated classifier.
Its further technical solution is that the training unit includes:
Initialization unit, for initializing the initial weight of each training sample in the feature training set
Repetitive exercise unit, for carrying out successive ignition training to the feature training set, repetitive exercise includes: each time
The training sub-classifier on the feature training setObtain classification function
(1) calculates the sub-classifier according to the following formulaClassification error rate
(2) calculate the sub-classifier according to the following formulaWeight
(3) and formula (4) obtain each training sample in the feature training set and carry out next iteration according to the following formula Trained weight
First determination unit determines classifier L according to result from above and following formula (5)j(x);
Wherein, x is the text vector of the training sample of the feature training set, and y is the training sample of the feature training set This emotion marks label, and i is the serial number of training sample in the feature training set, and M is feature training concentration training sample This quantity, t are the number of current repetitive exercise, and T is the total degree of repetitive exercise, and j is the serial number of the feature training set,For the feature training set training sample current iteration training weight,
Its further technical solution is that the unit of establishing includes:
Second determination unit, for according to formulaDetermine integrated classifier L (x), wherein m For the quantity of the feature training set.
The third aspect, the embodiment of the invention also provides a kind of computer equipments comprising memory and processor, it is described Computer program is stored on memory, the processor realizes the above method when executing the computer program.
Fourth aspect, the embodiment of the invention also provides a kind of computer readable storage medium, the storage medium storage There is computer program, the computer program includes program instruction, and described program instruction can be realized when being executed by a processor State method.
By the technical solution of the application embodiment of the present invention, can use in source domain by the data set of mark and a small amount of By the data set of mark in target domain, realizes and the data set not marked largely in target domain is labeled, thus Solve the problems, such as that mark sample needs to need respective mark sample by artificial acquisition and different fields in the prior art, It highly shortened handling time and model construction time.In method of the invention, using multiple classifiers combinations at collection Constituent class device avoids single classifier over-fitting, improves generalization ability.In addition, the present invention realizes that simply speed is fast, precision Height, can large scale deployment.
Detailed description of the invention
Technical solution in order to illustrate the embodiments of the present invention more clearly, below will be to needed in embodiment description Attached drawing is briefly described, it should be apparent that, drawings in the following description are some embodiments of the invention, general for this field For logical technical staff, without creative efforts, it is also possible to obtain other drawings based on these drawings.
Fig. 1 is a kind of flow diagram of cross-domain texts sensibility classification method provided in an embodiment of the present invention;
Fig. 2 is a kind of sub-process schematic diagram of cross-domain texts sensibility classification method provided in an embodiment of the present invention;
Fig. 3 be another embodiment of the present invention provides a kind of cross-domain texts sensibility classification method flow diagram;
Fig. 4 is a kind of schematic block diagram of cross-domain texts emotional semantic classification device provided in an embodiment of the present invention;
Fig. 5 is a kind of schematic frame of the combining unit of cross-domain texts emotional semantic classification device provided in an embodiment of the present invention Figure;
Fig. 6 is a kind of schematic frame of the training unit of cross-domain texts emotional semantic classification device provided in an embodiment of the present invention Figure;
Fig. 7 is a kind of schematic frame for establishing unit of cross-domain texts emotional semantic classification device provided in an embodiment of the present invention Figure;
Fig. 8 be another embodiment of the present invention provides a kind of cross-domain texts emotional semantic classification device schematic block diagram;With And
Fig. 9 is the schematic block diagram of computer equipment provided in an embodiment of the present invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are some of the embodiments of the present invention, instead of all the embodiments.Based on this hair Embodiment in bright, every other implementation obtained by those of ordinary skill in the art without making creative efforts Example, shall fall within the protection scope of the present invention.
It should be appreciated that ought use in this specification and in the appended claims, term " includes " and "comprising" instruction Described feature, entirety, step, operation, the presence of element and/or component, but one or more of the other feature, whole is not precluded Body, step, operation, the presence or addition of element, component and/or its set.
It is also understood that mesh of the term used in this description of the invention merely for the sake of description specific embodiment And be not intended to limit the present invention.As description of the invention and it is used in the attached claims, unless on Other situations are hereafter clearly indicated, otherwise " one " of singular, "one" and "the" are intended to include plural form.
It will be further appreciated that the term "and/or" used in description of the invention and the appended claims is Refer to any combination and all possible combinations of one or more of associated item listed, and including these combinations.
As used in this specification and in the appended claims, term " if " can be according to context quilt Be construed to " when ... " or " once " or " in response to determination " or " in response to detecting ".Similarly, phrase " if it is determined that " or " if detecting [described condition or event] " can be interpreted to mean according to context " once it is determined that " or " in response to true It is fixed " or " once detecting [described condition or event] " or " in response to detecting [described condition or event] ".
Fig. 1 is the flow diagram of cross-domain texts sensibility classification method provided in an embodiment of the present invention.As shown, should Method includes the following steps S1-S5.
First training set and the second training set are merged into third training set, and are instructed by term vector tool to third by S1 Practice collection and carries out term vector training to obtain the text vector of each sample and emotion label label in third training set.Wherein, One training set is by the training set of mark in source domain, and the second training set is in target domain by the training set of mark, mesh It further include the training set without mark in mark field.
In embodiments of the present invention, the first training set is by the training set of mark in source domain, and the second training set is mesh Pass through the training set of mark in mark field, further includes the training set without mark in target domain.The purpose of the embodiment of the present invention It is by being led by passing through the training set of mark on a small quantity in the training set and target domain of mark to target in source domain Largely the training set without mark is labeled in domain.
In specific implementation, the first training set and the second training set are merged into third training set, pass through term vector tool pair Third training set carries out term vector training to obtain the text vector of each sample and emotion label label in third training set.It is logical It crosses and merges the first training set with the second training set, then carrying out term vector training to third training set by term vector tool can It is associated with so that being generated between the first training set and the second training set, so that the result of mark is more accurate.
In one embodiment, the term vector tool used is word2vec.Word2vec is a word to be characterized as real number It is worth the efficient tool of vector, utilizes the thought of deep learning, the processing to content of text can be reduced to by training by K Vector operation in dimensional vector space, and the similarity in vector space can be used to indicate the similarity on text semantic.
Come it should be noted that other kinds of term vector tool also can be used in those skilled in the art to third training set Term vector training is carried out, the present invention is not specifically limited in this embodiment.
First training set is divided into multiple sub- training sets by S2, and respectively gathers each sub- training set and the second training And accordingly obtain multiple feature training sets.
In specific implementation, the first training set is divided into multiple sub- training sets, and respectively by each sub- training set and second Training set merges, and accordingly obtains multiple feature training sets.For example, in one embodiment, the first training set is A, the second training set For B.First training set A is divided into sub- training set A1, sub- training set A2 and sub- training set A3, and respectively by sub- training set A1, sub- training set A2 and sub- training set A3 merge in the second training set B, accordingly obtain three feature training sets.
In one embodiment, the first training ensemble average is divided into multiple sub- training sets, i.e. sample in every sub- training set This quantity is identical.
S3 marks label to a classifier according to the text vector and emotion of each feature training set and each sample respectively It is trained.
In specific implementation, respectively the text of each sample according to obtained in each feature training set and above step S1 to Amount and emotion mark one classifier of label training.
In one embodiment, specifically training process includes the following steps S310-S330.
S310, the initial weight of each training sample in initialization feature training set
In specific implementation, the initial weight of each training sample first in setting feature training setWherein, x is spy The text vector of the training sample of training set is levied, i is characterized the serial number of training sample in training set, and j is characterized the serial number of training set.
In one embodiment, by the initial weight of training sample each in feature training setIt is set asM is that M is The quantity of training sample in feature training set.
S320 carries out successive ignition training to feature training set.
In specific implementation, successive ignition training is carried out according to feature training set, wherein repetitive exercise includes following each time Step 1-4.
Step 1, the training sub-classifier on feature training setObtain classification function
Step 2, according to formulaCalculate sub-classifierClassification error rate
Step 3, according to formulaCalculate sub-classifierWeight
Step 4, according to formulaIt obtains and is respectively instructed in feature training set Practice the weight that sample carries out next iteration training
It should be noted that
The meaning of each letter is as follows in above each formula: x is characterized the text vector of the training sample of training set, and y is spy The emotion for levying the training sample of training set marks label, and i is characterized the serial number of training sample in training set, and M is characterized training set The quantity of middle training sample, t are the number of current iteration training, and T is the total degree of repetitive exercise, and j is characterized the sequence of training set Number,It is characterized weight of the training sample in current iteration training of training set, it should be noted that training sample first Weight when secondary repetitive exercise is initial weight
S330, according to result from above and formulaDetermine classifier Lj(x)。
In specific implementation, according to formulaDetermine classifier Lj(x)。Lj(x) i.e. for according to spy The classifier that sign training set training obtains.It is trained respectively by multiple feature training sets using above step S310-S330, Multiple classifier L can accordingly be obtainedj(x)。
The meaning of each letter is as follows in above each formula: x is characterized the text vector of the training sample of training set, and y is spy The emotion for levying the training sample of training set marks label, and i is characterized the serial number of training sample in training set, and M is characterized training set The quantity of middle training sample, t are the number of current iteration training, and T is the total degree of repetitive exercise, and j is characterized the sequence of training set Number,It is characterized weight of the training sample in current iteration training of training set, it should be noted that training sample first Weight when secondary repetitive exercise is initial weight
S4 establishes integrated classifier according to the classifier after all training.
In specific implementation, integrated classifier is established according to the classifier after all training.
In one embodiment, according to formulaDetermine integrated classifier L (x).
Wherein, m is characterized the quantity of training set, and j is characterized the serial number of training set, Lj(x) for according to feature training training The classifier got.
Use multiple classifier Lj(x) it is combined into integrated classifier L (x), single classifier over-fitting is avoided, improves general Change ability.
S5 acts on the training set in target domain without mark by integrated classifier.
In specific implementation, the training set in target domain without mark is acted on by integrated classifier, mesh can be obtained The emotional semantic classification label of each sample in training set in mark field without mark.
By the technical solution of the application embodiment of the present invention, can use in source domain by the data set of mark and a small amount of By the data set of mark in target domain, realizes and the data set not marked largely in target domain is labeled, thus Solve the problems, such as that mark sample needs to need respective mark sample by artificial acquisition and different fields in the prior art, It highly shortened handling time and model construction time.In method of the invention, using multiple classifiers combinations at collection Constituent class device avoids single classifier over-fitting, improves generalization ability.In addition, the present invention realizes that simply speed is fast, precision Height, can large scale deployment.
Fig. 3 be another embodiment of the present invention provides a kind of cross-domain texts sensibility classification method flow diagram.Such as Shown in Fig. 4, the cross-domain texts sensibility classification method of the present embodiment includes step S31-S37.Wherein step S33-S37 with it is above-mentioned Step S1-S5 in embodiment is similar, and details are not described herein.The following detailed description of in the present embodiment increase step S31- S32。
S31 carries out word segmentation processing to the first training set and the second training set.
Participle is a basic steps of text-processing, i.e., extracts word from the first training set and the second training set As sample.In specific implementation, word segmentation processing is carried out to the first training set and the second training set using participle tool.
S32 removes the stop-word in the first training set and the second training set.
In specific implementation, the stop-word in the first training set and the second training set is removed.Stop-word (stopword), often For preposition, adverbial word or conjunction etc..For example, " ", " the inside ", " also ", " ", " it ", " being " etc. be all stop-word.These words because It is excessively high for frequency of use, so needing to remove stop-word.
Fig. 4 is a kind of schematic block diagram of cross-domain texts emotional semantic classification device 40 provided in an embodiment of the present invention.Such as Fig. 4 It is shown, correspond to the above cross-domain texts sensibility classification method, the present invention also provides a kind of cross-domain texts emotional semantic classification devices 40.The cross-domain texts emotional semantic classification device 40 includes the unit for executing above-mentioned cross-domain texts sensibility classification method, should Device can be configured in desktop computer, tablet computer, laptop computer, etc. in terminals.Specifically, referring to Fig. 4, this is cross-cutting Text emotion sorter 40 includes acquiring unit 41, combining unit 42, training unit 43, establishes unit 44 and action cell 45。
Acquiring unit 41 for the first training set and the second training set to be merged into third training set, and passes through term vector Tool carries out term vector training to third training set to obtain the text vector and emotion of each sample in the third training set Mark label, wherein first training set is by the training set of mark in source domain, and second training set is target neck Pass through the training set of mark in domain, further includes the training set without mark in the target domain;
Combining unit 42, for first training set to be divided into multiple sub- training sets, and respectively by each son Training set merges with second training set, accordingly obtains multiple feature training sets;
Training unit 43, for being marked respectively according to the text vector and emotion of each feature training set and each sample Label is trained a classifier;
Unit 44 is established, for establishing integrated classifier according to the classifier after all training;
Action cell 45 acts on the training set in the target domain without mark by the integrated classifier.
In one embodiment, as shown in figure 5, the combining unit 42 includes equal sub-unit 421.
Equal sub-unit 421, for the first training ensemble average to be divided into multiple sub- training sets.
In one embodiment, as shown in fig. 6, the training unit 43 includes initialization unit 431, repetitive exercise unit 432 and first determination unit 433.
Initialization unit 431, for initializing the initial weight of each training sample in the feature training set
Repetitive exercise unit 432, for carrying out successive ignition training to the feature training set, repetitive exercise includes: each time
The training sub-classifier on the feature training setObtain classification function
(1) calculates the sub-classifier according to the following formulaClassification error rate
(2) calculate the sub-classifier according to the following formulaWeight
(3) and formula (4) obtain each training sample in the feature training set and carry out next iteration according to the following formula Trained weight
First determination unit 433 determines classifier L according to result from above and following formula (5)j(x);
Wherein, x is the text vector of the training sample of the feature training set, and y is the training sample of the feature training set This emotion marks label, and i is the serial number of training sample in the feature training set, and M is feature training concentration training sample This quantity, t are the number of current repetitive exercise, and T is the total degree of repetitive exercise, and j is the serial number of the feature training set,For the feature training set training sample current iteration training weight,
In one embodiment, as shown in fig. 7, the unit 44 of establishing includes the second determination unit 441.
Second determination unit 441, for according to formulaDetermine integrated classifier L (x), In, m is the quantity of the feature training set.
Fig. 8 be another embodiment of the present invention provides a kind of cross-domain texts emotional semantic classification device schematic block diagram.Such as Shown in Fig. 8, the cross-domain texts emotional semantic classification device of the present embodiment is to increase participle unit 46 on the basis of above-described embodiment And removal unit 47.
Participle unit 46, for carrying out word segmentation processing to first training set and second training set.
Removal unit 47, for removing the stop-word in first training set and second training set.
It should be noted that it is apparent to those skilled in the art that, above-mentioned cross-domain texts emotion point The specific implementation process of class device 40 and each unit, can be with reference to the corresponding description in preceding method embodiment, for description Convenienct and succinct, details are not described herein.
Above-mentioned cross-domain texts emotional semantic classification device can be implemented as a kind of form of computer program, the computer program It can be run in computer equipment as shown in Figure 9.
Referring to Fig. 9, Fig. 9 is a kind of schematic block diagram of computer equipment provided by the embodiments of the present application.The computer Equipment 500 can be terminal, be also possible to server, wherein terminal can be smart phone, tablet computer, laptop, Desktop computer, personal digital assistant and wearable device etc. have the electronic equipment of communication function.Server can be independent Server is also possible to the server cluster of multiple server compositions.
Refering to Fig. 9, which includes processor 502, memory and the net connected by system bus 501 Network interface 505, wherein memory may include non-volatile memory medium 503 and built-in storage 504.
The non-volatile memory medium 503 can storage program area 5031 and computer program 5032.The computer program 5032 include program instruction, which is performed, and processor 502 may make to execute a kind of cross-domain texts emotional semantic classification Method.
The processor 502 is for providing calculating and control ability, to support the operation of entire computer equipment 500.
The built-in storage 504 provides environment for the operation of the computer program 5032 in non-volatile memory medium 503, should When computer program 5032 is executed by processor 502, processor 502 may make to execute a kind of cross-domain texts emotional semantic classification side Method.
The network interface 505 is used to carry out network communication with other equipment.It will be understood by those skilled in the art that in Fig. 8 The structure shown, only the block diagram of part-structure relevant to application scheme, does not constitute and is applied to application scheme The restriction of computer equipment 500 thereon, specific computer equipment 500 may include more more or fewer than as shown in the figure Component perhaps combines certain components or with different component layouts.
Wherein, the processor 502 is for running computer program 5032 stored in memory, to realize following step It is rapid:
First training set and the second training set are merged into third training set, and by term vector tool to third training set Term vector training is carried out to obtain the text vector of each sample and emotion label label in the third training set, wherein institute Stating the first training set is by the training set of mark in source domain, and second training set is in target domain by the instruction of mark Practice collection, further includes the training set without mark in the target domain;
First training set is divided into multiple sub- training sets, and respectively by each sub- training set and described second Training set merges, and accordingly obtains multiple feature training sets;
Label is marked to classify one according to the text vector and emotion of each feature training set and each sample respectively Device is trained;
Integrated classifier is established according to the classifier after all training;
The training set in the target domain without mark is acted on by the integrated classifier.
In one embodiment, first training set described be divided into multiple sub- training sets and walked realizing by processor 502 When rapid, it is implemented as follows step:
The first training ensemble average is divided into multiple sub- training sets.
In one embodiment, processor 502 obtains one in described train respectively according to each feature training set of realization When a classifier step, it is implemented as follows step:
Initialize the initial weight of each training sample in the feature training set
Successive ignition training is carried out to the feature training set, repetitive exercise includes: each time
The training sub-classifier on the feature training setObtain classification function
(1) calculates the sub-classifier according to the following formulaClassification error rate
(2) calculate the sub-classifier according to the following formulaWeight
(3) and formula (4) obtain each training sample in the feature training set and carry out next iteration according to the following formula Trained weight
Classifier L is determined according to result from above and following formula (5)j(x);
Wherein, x is the text vector of the training sample of the feature training set, and y is the training sample of the feature training set This emotion marks label, and i is the serial number of training sample in the feature training set, and M is feature training concentration training sample This quantity, t are the number of current repetitive exercise, and T is the total degree of repetitive exercise, and j is the serial number of the feature training set,For the feature training set training sample current iteration training weight,
In one embodiment, processor 502 is realizing the classifier obtained according to each feature training set training When establishing integrated classifier step, it is implemented as follows step:
According to formulaDetermine integrated classifier L (x), wherein m is the feature training set Quantity.
In one embodiment, processor 502 realize it is described by the first training set and the second training set merge into third instruct Practice collection, and term vector training is carried out to third training set to obtain each sample in the third training set by term vector tool Before text vector and emotion label labelling step, following steps are also realized:
Word segmentation processing is carried out to first training set and second training set;
Remove the stop-word in first training set and second training set.
It should be appreciated that in the embodiment of the present application, processor 502 can be central processing unit (Central Processing Unit, CPU), which can also be other general processors, digital signal processor (Digital Signal Processor, DSP), specific integrated circuit (Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field-Programmable GateArray, FPGA) or other programmable logic devices Part, discrete gate or transistor logic, discrete hardware components etc..Wherein, general processor can be microprocessor or The processor is also possible to any conventional processor etc..
Those of ordinary skill in the art will appreciate that be realize above-described embodiment method in all or part of the process, It is that relevant hardware can be instructed to complete by computer program.The computer program includes program instruction, computer journey Sequence can be stored in a storage medium, which is computer readable storage medium.The program instruction is by the department of computer science At least one processor in system executes, to realize the process step of the embodiment of the above method.
Therefore, the present invention also provides a kind of storage mediums.The storage medium can be computer readable storage medium.This is deposited Storage media is stored with computer program, and wherein computer program includes program instruction.The program instruction makes when being executed by processor Processor executes following steps:
First training set and the second training set are merged into third training set, and by term vector tool to third training set Term vector training is carried out to obtain the text vector of each sample and emotion label label in the third training set, wherein institute Stating the first training set is by the training set of mark in source domain, and second training set is in target domain by the instruction of mark Practice collection, further includes the training set without mark in the target domain;
First training set is divided into multiple sub- training sets, and respectively by each sub- training set and described second Training set merges, and accordingly obtains multiple feature training sets;
Label is marked to classify one according to the text vector and emotion of each feature training set and each sample respectively Device is trained;
Integrated classifier is established according to the classifier after all training;
The training set in the target domain without mark is acted on by the integrated classifier.
In one embodiment, the processor is realized described by first training set stroke in the instruction of execution described program When being divided into multiple sub- training set steps, it is implemented as follows step:
The first training ensemble average is divided into multiple sub- training sets.
In one embodiment, the processor is realized described respectively according to each spy in the instruction of execution described program When sign training set training obtains a classifier step, it is implemented as follows step:
Initialize the initial weight of each training sample in the feature training set
Successive ignition training is carried out to the feature training set, repetitive exercise includes: each time
The training sub-classifier on the feature training setObtain classification function
(1) calculates the sub-classifier according to the following formulaClassification error rate
(2) calculate the sub-classifier according to the following formulaWeight
(3) and formula (4) obtain each training sample in the feature training set and carry out next iteration according to the following formula Trained weight
Classifier L is determined according to result from above and following formula (5)j(x);
Wherein, x is the text vector of the training sample of the feature training set, and y is the training sample of the feature training set This emotion marks label, and i is the serial number of training sample in the feature training set, and M is feature training concentration training sample This quantity, t are the number of current repetitive exercise, and T is the total degree of repetitive exercise, and j is the serial number of the feature training set,For the feature training set training sample current iteration training weight,
In one embodiment, the processor is realized described according to each feature training in the instruction of execution described program When the classifier got of assembling for training establishes integrated classifier step, it is implemented as follows step:
According to formulaDetermine integrated classifier L (x), wherein m is the feature training set Quantity.
In one embodiment, the processor is realized described by the first training set and second in the instruction of execution described program Training set merges into third training set, and by term vector tool carries out term vector training to third training set to obtain described the Before the text vector of each sample and emotion mark labelling step in three training sets, following steps are also realized:
Word segmentation processing is carried out to first training set and second training set;
Remove the stop-word in first training set and second training set.
The storage medium can be USB flash disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), magnetic disk Or the various computer readable storage mediums that can store program code such as CD.
Those of ordinary skill in the art may be aware that list described in conjunction with the examples disclosed in the embodiments of the present disclosure Member and algorithm steps, can be realized with electronic hardware, computer software, or a combination of the two, in order to clearly demonstrate hardware With the interchangeability of software, each exemplary composition and step are generally described according to function in the above description.This A little functions are implemented in hardware or software actually, the specific application and design constraint depending on technical solution.Specially Industry technical staff can use different methods to achieve the described function each specific application, but this realization is not It is considered as beyond the scope of this invention.
In several embodiments provided by the present invention, it should be understood that disclosed device and method can pass through it Its mode is realized.For example, the apparatus embodiments described above are merely exemplary.For example, the division of each unit, only Only a kind of logical function partition, there may be another division manner in actual implementation.Such as multiple units or components can be tied Another system is closed or is desirably integrated into, or some features can be ignored or not executed.
The steps in the embodiment of the present invention can be sequentially adjusted, merged and deleted according to actual needs.This hair Unit in bright embodiment device can be combined, divided and deleted according to actual needs.In addition, in each implementation of the present invention Each functional unit in example can integrate in one processing unit, is also possible to each unit and physically exists alone, can also be with It is that two or more units are integrated in one unit.
If the integrated unit is realized in the form of SFU software functional unit and when sold or used as an independent product, It can store in one storage medium.Based on this understanding, technical solution of the present invention is substantially in other words to existing skill The all or part of part or the technical solution that art contributes can be embodied in the form of software products, the meter Calculation machine software product is stored in a storage medium, including some instructions are used so that a computer equipment (can be a People's computer, terminal or network equipment etc.) it performs all or part of the steps of the method described in the various embodiments of the present invention.
In the above-described embodiments, it all emphasizes particularly on different fields to the description of each embodiment, is not described in some embodiment Part, reference can be made to the related descriptions of other embodiments.
Obviously, various changes and modifications can be made to the invention without departing from essence of the invention by those skilled in the art Mind and range.In this way, the even these modifications and changes of the present invention range that belongs to the claims in the present invention and its equivalent technologies Within, then the present invention is also intended to include these modifications and variations.
The above description is merely a specific embodiment, but scope of protection of the present invention is not limited thereto, any Those familiar with the art in the technical scope disclosed by the present invention, can readily occur in various equivalent modifications or replace It changes, these modifications or substitutions should be covered by the protection scope of the present invention.Therefore, protection scope of the present invention should be with right It is required that protection scope subject to.

Claims (10)

1. a kind of cross-domain texts sensibility classification method characterized by comprising
First training set and the second training set are merged into third training set, and third training set is carried out by term vector tool Term vector training is to obtain in the third training set text vector of each sample and emotion label label, wherein described the One training set is by the training set of mark in source domain, and second training set is in target domain by the training of mark Collect, further includes the training set without mark in the target domain;
First training set is divided into multiple sub- training sets, and respectively by each sub- training set and second training Collection merges, and accordingly obtains multiple feature training sets;
Respectively according to the text vector and emotion of each feature training set and each sample mark label to a classifier into Row training;
Integrated classifier is established according to the classifier after all training;
The training set in the target domain without mark is acted on by the integrated classifier.
2. cross-domain texts sensibility classification method according to claim 1, which is characterized in that described by first training Collection is divided into multiple sub- training sets, comprising:
The first training ensemble average is divided into multiple sub- training sets.
3. cross-domain texts sensibility classification method according to claim 1, which is characterized in that described respectively according to each institute It states the training of feature training set and obtains a classifier, comprising:
Initialize the initial weight of each training sample in the feature training set
Successive ignition training is carried out to the feature training set, repetitive exercise includes: each time
The training sub-classifier on the feature training setObtain classification function
(1) calculates the sub-classifier according to the following formulaClassification error rate
(2) calculate the sub-classifier according to the following formulaWeight
(3) and formula (4) obtain each training sample in the feature training set and carry out next iteration training according to the following formula Weight
Classifier L is determined according to result from above and following formula (5)j(x);
Wherein, x is the text vector of the training sample of the feature training set, and y is the training sample of the feature training set Emotion marks label, and i is the serial number of training sample in the feature training set, and M is training sample in the feature training set Quantity, t are the number of current repetitive exercise, and T is the total degree of repetitive exercise, and j is the serial number of the feature training set,The weight trained for the training sample of the feature training set in current iteration.
4. cross-domain texts sensibility classification method according to claim 3, which is characterized in that described according to each feature The classifier that training set training obtains establishes integrated classifier, comprising:
According to formulaDetermine integrated classifier L (x), wherein m is the number of the feature training set Amount.
5. the cross-domain texts sensibility classification method according to claim 1, which is characterized in that described by first Training set and the second training set merge into third training set, and carry out term vector training to third training set by term vector tool Before obtaining in the third training set text vector of each sample and emotion label label, the method also includes:
Word segmentation processing is carried out to first training set and second training set;
Remove the stop-word in first training set and second training set.
6. a kind of cross-domain texts emotional semantic classification device characterized by comprising
Acquiring unit for the first training set and the second training set to be merged into third training set, and passes through term vector tool pair Third training set carries out term vector training to obtain the text vector of each sample and emotion label mark in the third training set Label, wherein first training set is by the training set of mark in source domain, and second training set is to pass through in target domain The training set of mark is crossed, further includes the training set without mark in the target domain;
Combining unit, for first training set to be divided into multiple sub- training sets, and respectively by each sub- training set Merge with second training set, accordingly obtains multiple feature training sets;
Training unit, for marking label pair according to the text vector and emotion of each feature training set and each sample respectively One classifier is trained;
Unit is established, for establishing integrated classifier according to the classifier after all training;
Action cell acts on the training set in the target domain without mark by the integrated classifier.
7. cross-domain texts emotional semantic classification device according to claim 6, which is characterized in that the training unit includes:
Initialization unit, for initializing the initial weight of each training sample in the feature training setRepetitive exercise Unit, for carrying out successive ignition training to the feature training set, repetitive exercise includes: each time
The training sub-classifier on the feature training setObtain classification function
(1) calculates the sub-classifier according to the following formulaClassification error rate
(2) calculate the sub-classifier according to the following formulaWeight
(3) and formula (4) obtain each training sample in the feature training set and carry out next iteration training according to the following formula Weight
First determination unit determines classifier L according to result from above and following formula (5)j(x);
Wherein, x is the text vector of the training sample of the feature training set, and y is the training sample of the feature training set Emotion marks label, and i is the serial number of training sample in the feature training set, and M is training sample in the feature training set Quantity, t are the number of current repetitive exercise, and T is the total degree of repetitive exercise, and j is the serial number of the feature training set,The weight trained for the training sample of the feature training set in current iteration.
8. cross-domain texts emotional semantic classification device according to claim 7, which is characterized in that the unit of establishing includes:
Second determination unit, for according to formulaDetermine integrated classifier L (x), wherein m is institute State the quantity of feature training set.
9. a kind of computer equipment, which is characterized in that the computer equipment includes memory and processor, on the memory It is stored with computer program, the processor is realized as described in any one of claim 1-5 when executing the computer program Method.
10. a kind of storage medium, which is characterized in that the storage medium is stored with computer program, the computer program quilt Processor can realize method according to any one of claims 1 to 5 when executing.
CN201810770172.5A 2018-07-13 2018-07-13 Cross-domain texts sensibility classification method, device, computer equipment and storage medium Withdrawn CN108959265A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810770172.5A CN108959265A (en) 2018-07-13 2018-07-13 Cross-domain texts sensibility classification method, device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810770172.5A CN108959265A (en) 2018-07-13 2018-07-13 Cross-domain texts sensibility classification method, device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN108959265A true CN108959265A (en) 2018-12-07

Family

ID=64483990

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810770172.5A Withdrawn CN108959265A (en) 2018-07-13 2018-07-13 Cross-domain texts sensibility classification method, device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN108959265A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109857861A (en) * 2019-01-04 2019-06-07 平安科技(深圳)有限公司 File classification method, device, server and medium based on convolutional neural networks
CN110009038A (en) * 2019-04-04 2019-07-12 北京百度网讯科技有限公司 Training method, device and the storage medium of screening model
CN110378389A (en) * 2019-06-24 2019-10-25 苏州浪潮智能科技有限公司 A kind of Adaboost classifier calculated machine creating device
CN111078876A (en) * 2019-12-04 2020-04-28 国家计算机网络与信息安全管理中心 Short text classification method and system based on multi-model integration
WO2020143303A1 (en) * 2019-01-10 2020-07-16 平安科技(深圳)有限公司 Method and device for training deep learning model, computer apparatus, and storage medium
CN111831826A (en) * 2020-07-24 2020-10-27 腾讯科技(深圳)有限公司 Training method, classification method and device of cross-domain text classification model
US11423333B2 (en) 2020-03-25 2022-08-23 International Business Machines Corporation Mechanisms for continuous improvement of automated machine learning

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109857861A (en) * 2019-01-04 2019-06-07 平安科技(深圳)有限公司 File classification method, device, server and medium based on convolutional neural networks
WO2020143303A1 (en) * 2019-01-10 2020-07-16 平安科技(深圳)有限公司 Method and device for training deep learning model, computer apparatus, and storage medium
CN110009038A (en) * 2019-04-04 2019-07-12 北京百度网讯科技有限公司 Training method, device and the storage medium of screening model
CN110378389A (en) * 2019-06-24 2019-10-25 苏州浪潮智能科技有限公司 A kind of Adaboost classifier calculated machine creating device
CN111078876A (en) * 2019-12-04 2020-04-28 国家计算机网络与信息安全管理中心 Short text classification method and system based on multi-model integration
US11423333B2 (en) 2020-03-25 2022-08-23 International Business Machines Corporation Mechanisms for continuous improvement of automated machine learning
CN111831826A (en) * 2020-07-24 2020-10-27 腾讯科技(深圳)有限公司 Training method, classification method and device of cross-domain text classification model
CN111831826B (en) * 2020-07-24 2022-10-18 腾讯科技(深圳)有限公司 Training method, classification method and device of cross-domain text classification model

Similar Documents

Publication Publication Date Title
CN108959265A (en) Cross-domain texts sensibility classification method, device, computer equipment and storage medium
CN106611052B (en) The determination method and device of text label
CN106484139B (en) Emoticon recommended method and device
CN108399228A (en) Article sorting technique, device, computer equipment and storage medium
CN107038480A (en) A kind of text sentiment classification method based on convolutional neural networks
CN106445919A (en) Sentiment classifying method and device
CN109376240A (en) A kind of text analyzing method and terminal
CN109299264A (en) File classification method, device, computer equipment and storage medium
CN110532563A (en) The detection method and device of crucial paragraph in text
CN105095179B (en) The method and device that user's evaluation is handled
CN108733675B (en) Emotion evaluation method and device based on large amount of sample data
CN111046886A (en) Automatic identification method, device and equipment for number plate and computer readable storage medium
CN105956083A (en) Application software classification system, application software classification method and server
Otte et al. Local feature based online mode detection with recurrent neural networks
CN105117740A (en) Font identification method and device
CN106778878A (en) A kind of character relation sorting technique and device
CN106537387B (en) Retrieval/storage image associated with event
Nihal et al. Bangla sign alphabet recognition with zero-shot and transfer learning
CN109471932A (en) Rumour detection method, system and storage medium based on learning model
CN102708164A (en) Method and system for calculating movie expectation
CN109446300A (en) A kind of corpus preprocess method, the pre- mask method of corpus and electronic equipment
CN108009248A (en) A kind of data classification method and system
CN109359198A (en) A kind of file classification method and device
CN107330009A (en) Descriptor disaggregated model creation method, creating device and storage medium
CN109597987A (en) A kind of text restoring method, device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20181207

WW01 Invention patent application withdrawn after publication