CN108733778A - The industry type recognition methods of object and device - Google Patents

The industry type recognition methods of object and device Download PDF

Info

Publication number
CN108733778A
CN108733778A CN201810420223.1A CN201810420223A CN108733778A CN 108733778 A CN108733778 A CN 108733778A CN 201810420223 A CN201810420223 A CN 201810420223A CN 108733778 A CN108733778 A CN 108733778A
Authority
CN
China
Prior art keywords
identified
industry type
vector space
training sample
industry
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810420223.1A
Other languages
Chinese (zh)
Other versions
CN108733778B (en
Inventor
赵辉
崔燕
岳爱珍
谭静
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201810420223.1A priority Critical patent/CN108733778B/en
Publication of CN108733778A publication Critical patent/CN108733778A/en
Application granted granted Critical
Publication of CN108733778B publication Critical patent/CN108733778B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Machine Translation (AREA)

Abstract

The present invention proposes industry type recognition methods and the device of a kind of object, wherein method includes:The text message of object to be identified is inputted in the language model for generating paragraph vector and is learnt, obtain object to be identified with the relevant vector space of industry type;According to the vector space of each object to be identified, the first object to be identified is chosen from all objects to be identified as training sample object, obtains the labeled data of training sample object;Using the vector space and labeled data of training sample object, the industry type identification model of structure is trained, obtains target industry type identification model;The vector space of the second object to be identified is input in target industry type identification model and is learnt, obtains the industry type that the second object to be identified is subordinate to for the second object to be identified each of in addition to training sample object.This method can promote the accuracy rate of the recognition result of industry type identification model.

Description

The industry type recognition methods of object and device
Technical field
The present invention relates to the industry type recognition methods of Internet technical field more particularly to a kind of object and devices.
Background technology
With the continuous development of Internet technology and universal, the letter of user and enterprise in each dimension of terminal device Cease data it is more and more, and technology be constantly progressive so as to these information datas be calculated as in order to possibility, to user and enterprise The analysis of industry and portrait and herein increasingly personalization and the fine granularity such as technical marketing, recommendation.In this increasingly face Into personalized and fine granularity application scenarios, trade information is wherein vital link, industry interest of user etc. It is the basis in these individuation datas, and carries out the classification of industry type to enterprise, major network platform can be assisted to carry out The excavation of potential customers.
In the prior art, by extracting keyword from network log, the browsing text messages such as information, by from being labeled with Industry word set is excavated in the text message of industry label, for the text message of enterprise to be sorted, judges it whether comprising row Industry word in industry word set, as the industrial characteristic of text message, based on naive Bayesian, logistic regression (logistic Regression), gradient promotes decision tree (Gradient Boosting Decision Tree, abbreviation GBDT) scheduling algorithm, structure Construction Bank's industry type identification model.Wherein, it is screened from the text message of enterprise for being labeled with industry label using bayes method Go out with the relevant high posterior probability word of industry, after artificial screening, generates industry word set.
Under this mode, the recognition result of industry type identification model is limited to the institute from the text message for having industry label The industry word set that discovery and arrangement obtains is to the level of coverage of all industry words, if including in the text message of enterprise to be sorted It is less or even when there is the industry word for not including in industry word set, then the accuracy rate of the recognition result of industry type identification model compared with It is low.
Invention content
The present invention is directed to solve at least some of the technical problems in related technologies.
For this purpose, first purpose of the present invention is to propose a kind of industry type recognition methods of object, promoted with realizing The accuracy rate of the recognition result of industry type identification model avoids the recall rate of industry type identification model in the prior art from being instructed The case where scale of white silk data is limited.
Second object of the present invention is to propose a kind of industry type identification device of object.
Third object of the present invention is to propose a kind of computer equipment.
Fourth object of the present invention is to propose a kind of non-transitorycomputer readable storage medium.
The 5th purpose of the present invention is to propose a kind of computer program product.
In order to achieve the above object, first aspect present invention embodiment proposes a kind of industry type recognition methods of object, packet It includes:
The text message of object to be identified is inputted in the language model for generating paragraph vector and is learnt, institute is obtained State object to be identified with the relevant vector space of industry type;
According to the vector space of each object to be identified, the first object to be identified work is chosen from all objects to be identified For training sample object;
Obtain the labeled data of the training sample object, wherein the labeled data is used to indicate out the trained sample The industry type that this object is subordinate to;
The vector space using the training sample object and the labeled data identify the industry type of structure Model is trained, and obtains target industry type identification model;
For the second object to be identified each of in addition to the training sample object, by the described second object to be identified Vector space, be input in the target industry type identification model and learnt, obtain the second object to be identified institute The industry type being subordinate to.
The industry type recognition methods of the object of the embodiment of the present invention, since language model can be from all to be identified right The semantic of paragraph vector is carried out in the text message of elephant to learn, in the vector space after study, word of the same trade and information meeting It clusters in vector space, therefore the target industry type identification model based on limited training sample, remains able to know Do not go out the industry type that the second object to be identified is subordinate to, industry type identification model can only so as to avoid in the prior art Classify to the text message comprising industry word in existing industry word set, occurs when in the text message of the second object to be identified When industry word not in industry word set, then the case where can not being identified by industry type identification model, that is, the prior art is avoided The recall rate of middle industry type identification model be trained to data scale it is limited the case where.
In order to achieve the above object, second aspect of the present invention embodiment proposes a kind of industry type identification device of object, packet It includes:
First input module, for the text message of object to be identified to be inputted to the language model for generating paragraph vector In learnt, obtain the object to be identified with the relevant vector space of industry type;
Module is chosen, for according to the vector space of each object to be identified, choosing the from all objects to be identified One object to be identified is as training sample object;
Acquisition module, the labeled data for obtaining the training sample object, wherein the labeled data is used to indicate Go out the industry type that the training sample object is subordinate to;
Training module, the vector space for utilizing the training sample object and the labeled data, to structure Industry type identification model be trained, obtain target industry type identification model;;
Second input module, for being directed to the second object to be identified each of in addition to the training sample object, by institute The vector space for stating the second object to be identified is input in the target industry type identification model and is learnt, obtains institute State the industry type that the second object to be identified is subordinate to.
The industry type identification device of the object of the embodiment of the present invention, since language model can be from all to be identified right The semantic of paragraph vector is carried out in the text message of elephant to learn, in the vector space after study, word of the same trade and information meeting It clusters in vector space, therefore the target industry type identification model based on limited training sample, remains able to know Do not go out the industry type that the second object to be identified is subordinate to, industry type identification model can only so as to avoid in the prior art Classify to the text message comprising industry word in existing industry word set, occurs when in the text message of the second object to be identified When industry word not in industry word set, then the case where can not being identified by industry type identification model, that is, the prior art is avoided The recall rate of middle industry type identification model be trained to data scale it is limited the case where.
In order to achieve the above object, third aspect present invention embodiment proposes a kind of computer equipment, including:It processor and deposits Reservoir;
Wherein, the processor by read the executable program code stored in the memory run with it is described can The corresponding program of program code is executed, for realizing that the industry type of the object as described in first aspect present invention embodiment is known Other method.
To achieve the goals above, fourth aspect present invention embodiment proposes a kind of computer-readable storage of non-transitory Medium is stored thereon with computer program, which is characterized in that such as first aspect present invention is realized when the program is executed by processor The industry type recognition methods of object described in embodiment.
To achieve the goals above, fifth aspect present invention embodiment proposes a kind of computer program product, when described Instruction processing unit in computer program product realizes the industry of the object as described in first aspect present invention embodiment when executing Kind identification method.
The additional aspect of the present invention and advantage will be set forth in part in the description, and will partly become from the following description Obviously, or practice through the invention is recognized.
Description of the drawings
Above-mentioned and/or additional aspect and advantage of the invention will become from the following description of the accompanying drawings of embodiments Obviously and it is readily appreciated that, wherein:
The flow diagram of the industry type recognition methods for the object that Fig. 1 is provided by the embodiment of the present invention one;
The flow diagram of the industry type recognition methods for the object that Fig. 2 is provided by the embodiment of the present invention two;
The flow diagram of the industry type recognition methods for the object that Fig. 3 is provided by the embodiment of the present invention three;
The application scenarios schematic diagram that Fig. 4 is provided by the embodiment of the present invention four;
Fig. 5 is a kind of structural schematic diagram of the industry type identification device of object provided in an embodiment of the present invention;
Fig. 6 is the structural schematic diagram of the industry type identification device of another object provided in an embodiment of the present invention;
Fig. 7 shows the block diagram of the exemplary computer device suitable for being used for realizing the application embodiment.
Specific implementation mode
The embodiment of the present invention is described below in detail, examples of the embodiments are shown in the accompanying drawings, wherein from beginning to end Same or similar label indicates same or similar element or element with the same or similar functions.Below with reference to attached The embodiment of figure description is exemplary, it is intended to for explaining the present invention, and is not considered as limiting the invention.
Below with reference to the accompanying drawings industry type recognition methods and the device of the object of the embodiment of the present invention are described.It is specifically describing Before the embodiment of the present invention, in order to make it easy to understand, common technology word is introduced first:
Industry refers to the operating unit or individual for being engaged in connatural production or other economic societies in national economy The detailed division of organization structure, such as forestry, car industry, banking etc..
The flow diagram of the industry type recognition methods for the object that Fig. 1 is provided by the embodiment of the present invention one.
As shown in Figure 1, the industry type recognition methods of the object includes the following steps:
Step 101, the text message of object to be identified is inputted in the language model for generating paragraph vector and is learned Practise, obtain object to be identified with the relevant vector space of industry type.
In the embodiment of the present invention, object to be identified is the object for needing to carry out industry type identification, the text of object to be identified This information may include the title of object to be identified and the line operation range describing word that object to be identified is registered in industrial and commercial bureau Section.For example, when object to be identified is Baidu, business scope is the network information service, and therefore, the text message of Baidu can be with Including:Baidu and the network information service.
In the embodiment of the present invention, language model is trained in advance, and language model is used to text message generating paragraph Vector, such as language model can be vectorial (Doc2vec) model of unsupervised document.It is alternatively possible to using unsupervised Language model carries out the semantic study of paragraph vector, the vector space after study from the text message of all objects to be identified On, word of the same trade and information can cluster in vector space, thus based on the language model for generating paragraph vector, Obtained vector space can include more more fully trade informations.
In the embodiment of the present invention, language model is trained according to the text message of all objects to be identified, is led to Cross text message of the selection with the larger object to be identified of industry correlation, such as the title of object to be identified and to be identified right As the line operation range description field registered in industrial and commercial bureau, the unsupervised language model of training, obtained vector space can incline To in based on trade information.
Step 102, it according to the vector space of each object to be identified, from all objects to be identified chooses first and waits knowing Other object is as training sample object.
It is alternatively possible to the first object to be identified be randomly selected from all objects to be identified, as training sample pair As;Alternatively, the first object to be identified can be chosen according to preset order, as training sample object, for example, can will be all Object to be identified is ranked up, and then according to sequence from front to back or from back to front, it is to be identified to choose predetermined number first Object, as training sample object;Alternatively, from all objects to be identified can choose first and wait knowing according to preset algorithm Other object is not restricted this as training sample object.
Wherein, preset algorithm is pre-set, such as preset algorithm can be clustering algorithm, can be according to clustering algorithm Cluster for all objects to be identified, obtain it is each cluster, then extraction random from each cluster first waits knowing Other object, as training sample object.
Step 103, the labeled data of training sample object is obtained, wherein labeled data is used to indicate out training sample pair As the industry type being subordinate to.
As a kind of possible realization method, it can mark training sample object by way of manually marking and be subordinate to Industry type, for example, can according to the line operation range description field in the text message of training sample object, mark instruction Practice the industry type that sample object is subordinate to.After artificial mark, the labeled data of training sample object can be obtained.
Step 104, using the vector space and labeled data of training sample object, to the industry type identification model of structure It is trained, obtains target industry type identification model.
In the embodiment of the present invention, (Logistic Regression) algorithm structure industry type can be returned with logic-based Identification model, alternatively, can be based on convolutional neural networks (Convolutional Neural Networks, abbreviation CNN) and entirely Linking layer algorithm builds industry type identification model, alternatively, it is also based on other algorithms structure industry type identification model, it is right This is not restricted.
In the embodiment of the present invention, since the labeled data of training sample object can indicate that training sample object is subordinate to Industry type, and the vector space of training sample object contains the trade information of training sample object, therefore, utilizes training The vector space and labeled data of sample object, after being trained to the industry type identification model of structure, obtained target line The vector space of object to be identified can be identified in industry type identification model, determine the industry class that object to be identified is subordinate to Type.
Step 105, for the second object to be identified each of in addition to training sample object, by the second object to be identified Vector space, be input in target industry type identification model and learnt, obtain the row that the second object to be identified is subordinate to Industry type.
In the embodiment of the present invention, the second object to be identified is the object in addition to training sample object in object to be identified.
It is understood that for the second object to be identified each of in addition to training sample object, it is to be identified by second The vector space of object is input in target industry type identification model and is learnt, and what is obtained is the knowledge of each industry type Other probability.Therefore, in the embodiment of the present invention, the corresponding industry type of identification probability maximum value can be selected, waits knowing as second The industry type that other object is subordinate to.
For example, when industry type is divided into:A, defeated by the vector space of the second object to be identified when B, C, D, E, F, G Enter into target industry type identification model, what is obtained is the corresponding identification probability of A, B, C, D, E, F, G, it is assumed that A, B, C, D, E, F, the corresponding identification probabilities of G are respectively:P1, P2, P3, P4, P5, P6, P7, and when P3 value maximums, then can be waited for C as second The industry type that identification object is subordinate to.
In the embodiment of the present invention, do not go out if there is the industry descriptor in the text message of some the second object to be identified When in the industry word set of present training sample object, since the language model in step 101 can be from all objects to be identified Text message in carry out the semantic study of paragraph vector, in the vector space after study, word of the same trade and information can be It clusters in vector space, therefore the target industry type identification model based on limited training sample, remains able to identify The industry type that second object to be identified is subordinate to, to ensure industry type identification model recognition result accuracy.
The industry type recognition methods of the object of the present embodiment, since language model can be from all objects to be identified Carry out the semantic study of paragraph vector in text message, in the vector space after study, word of the same trade and information can to It clusters in quantity space, therefore the target industry type identification model based on limited training sample, remains able to identify The industry type that second object to be identified is subordinate to, so as to avoid in the prior art, industry type identification model can only be to packet Text message containing industry word in existing industry word set is classified, when occurring not existing in the text message of the second object to be identified When industry word in industry word set, then the case where can not being identified by industry type identification model, that is, avoid going in the prior art The recall rate of industry type identification model be trained to data scale it is limited the case where.
For an embodiment in clear explanation, the industry type recognition methods of another object, Fig. 2 are present embodiments provided The flow diagram of the industry type recognition methods of the object provided by the embodiment of the present invention two.
As shown in Fig. 2, the industry type recognition methods of the object may comprise steps of:
Step 201, the text message of object to be identified is separately input in the language model built based on algorithms of different, Obtain the primary vector space of each language model output.
In the embodiment of the present invention, it can be in advance based on algorithms of different structure language model, such as distributed word can be based on Bag (Distributed Bag of Words, abbreviation DBOW) algorithm and distributed memory (Distributed Memory, abbreviation DM) algorithm builds language model, obtains DBOW models and DM models.
It optionally, can be by the text envelope of object to be identified after building to obtain each language model based on algorithms of different Breath is separately input to be learnt in each language model, obtains the primary vector space of each language model output.
Step 202, primary vector space different language model exported, is combined into the vector space of object to be identified.
In the embodiment of the present invention, the primary vector space for the ease of exporting different language model is combined, each The dimension in the primary vector space of language model output is identical.For example, marking the primary vector space of each language model output It is tieed up for n.
Optionally, the number of markup language model is m, then the primary vector space exported different language model, group After the vector space for synthesizing object to be identified, dimension of a vector space is tieed up for mn.In subsequent step, can utilize mn dimension to Quantity space is trained industry type identification model, to be identified so as to promote target industry type identification Model Identification The accuracy rate for the industry type that object is subordinate to.
For example, when building language model based on DBOW algorithms and DM algorithms, DBOW models and DM models can be obtained, will be waited for The text message of identification object inputs in DBOW models and DM models and is learnt respectively, can obtain the of DBOW models output The primary vector space of one vector space and the output of DM models.The primary vector space that DBOW models are exported and the output of DM models Primary vector space be combined, the vector space of obtained object to be identified is 2n dimensions, in subsequent step, can be utilized The vector space of 2n dimensions, is trained industry type identification model, is waited for promote target industry type identification Model Identification The accuracy rate for the industry type that identification object is subordinate to.
Step 203, according to the vector space of object to be identified, the similarity between object to be identified is calculated, according to similar Degree clusters to all objects to be identified.
It is understood that according to the distance between object to be identified, such as cos similarities, Euclidean distance etc., it can sentence Semantic similarity between disconnected object to be identified, and then the higher object to be identified of semantic similarity can be clustered, it obtains To the object to be identified of same industry type.Therefore, in the embodiment of the present invention, can according to the vector space of object to be identified, The similarity between object to be identified is calculated, object to be identified that then can be by similarity more than predetermined threshold value clusters, Can obtain it is semantic it is similar it is each cluster, i.e., the object to be identified of same industry type is clustered.
It is alternatively possible to an object be randomly choosed from all objects to be identified, as basic object, then from it An object is selected in his object (object to be identified in addition to fundamental objects), it is right according to the vector space of the object and basis The vector space of elephant calculates similarity between the two, if similarity is higher than predetermined threshold value, by the object and fundamental objects into Otherwise the capable processing that clusters abandons the object.Then one object of reselection from other objects, continues to calculate the object and base Similarity between plinth object, until all objects to be identified cluster processing with fundamental objects completion, so as to obtaining and The object to be identified of fundamental objects same industry type.
Step 204, the first object to be identified is randomly selected from each cluster, as training sample object.
In the embodiment of the present invention, the first object to be identified is randomly selected from each cluster, as training sample object, from And training sample object can be related to each industry type, ensure that the training sample object extracted is representative.
Step 205, the labeled data of training sample object is obtained, wherein labeled data is used to indicate out training sample pair As the industry type being subordinate to.
Step 206, using the vector space and labeled data of training sample object, to the industry type identification model of structure It is trained, obtains target industry type identification model.
Step 207, for the second object to be identified each of in addition to training sample object, by the second object to be identified Vector space, be input in target industry type identification model and learnt, obtain the row that the second object to be identified is subordinate to Industry type.
The implementation procedure of step 205~207 may refer to the implementation procedure of step 103~105 in above-described embodiment, herein It does not repeat.
The industry type recognition methods of the object of the present embodiment, since language model can be from all objects to be identified Carry out the semantic study of paragraph vector in text message, in the vector space after study, word of the same trade and information can to It clusters in quantity space, therefore the target industry type identification model based on limited training sample, remains able to identify The industry type that second object to be identified is subordinate to, so as to avoid in the prior art, industry type identification model can only be to packet Text message containing industry word in existing industry word set is classified, when occurring not existing in the text message of the second object to be identified When industry word in industry word set, then the case where can not being identified by industry type identification model, that is, avoid going in the prior art The recall rate of industry type identification model be trained to data scale it is limited the case where.
For an embodiment in clear explanation, the industry type recognition methods of another object, Fig. 3 are present embodiments provided The flow diagram of the industry type recognition methods of the object provided by the embodiment of the present invention three.
As shown in figure 3, the industry type recognition methods of the object may comprise steps of:
Step 301, the text message of object to be identified is separately input in the language model built based on algorithms of different, Obtain the primary vector space of each language model output.
Step 302, primary vector space different language model exported, is combined into the vector space of object to be identified.
The implementation procedure of step 301~302 may refer to the implementation procedure of step 201~202 in above-described embodiment, herein It does not repeat.
Step 303, the mapping relations between the vector space of object to be identified and the identification information of object to be identified are established.
In the embodiment of the present invention, the identification information of object to be identified is to be identified right for the unique mark object to be identified The identification information of elephant for example can be the title of object to be identified, alternatively, the identification information of object to be identified can be to be identified The ID of object is not restricted this.
In the embodiment of the present invention, reflecting between the vector space of object to be identified and the identification information of object to be identified is established Relationship is penetrated, to which after determining the object to be identified for needing to carry out industry type identification, the mark of the object to be identified can be passed through Know information, inquire above-mentioned mapping relations, the vector space corresponding with the identification information of the object to be identified is obtained, without profit The text message that the object to be identified is re-recognized with speech model, obtains object to be identified and the relevant vector of industry type is empty Between, it is easy to operate and be easily achieved.
Step 304, according to mapping relations, the vector space of object to be identified is stored in dictionary.
Step 305, it according to the vector space of each object to be identified, from all objects to be identified chooses first and waits knowing Other object is as training sample object.
Step 306, the labeled data of training sample object is obtained, wherein labeled data is used to indicate out training sample pair As the industry type being subordinate to.
Step 307, using the vector space and labeled data of training sample object, to the industry type identification model of structure It is trained, obtains target industry type identification model.
The implementation procedure of step 305~306 may refer to the implementation procedure of step 102~104 in above-described embodiment, herein It does not repeat.
Step 308, the identification information of the second object to be identified is obtained.
Optionally, it is determining each of in addition to training sample object after the second object to be identified, i.e. determination needs to carry out After second object to be identified of industry type identification, the identification information of second object to be identified can be obtained, such as second waits for Identify the ID of object.Specifically, the page of the identification information of voice or word the second object to be identified of input can be provided, and The second object to be identified for needing to carry out industry type identification inputted afterwards by page acquisition user speech or word Identification information.
Step 309, according to the identification information of the second object to be identified, mapping relations is inquired, second is obtained from dictionary and is waited for Identify the vector space of object.
In the embodiment of the present invention, after determining the second object to be identified for needing to carry out industry type identification, it can pass through The identification information of second object to be identified inquires above-mentioned mapping relations, is obtained from dictionary and second object to be identified Vector space corresponding to identification information, without re-recognizing the text envelope of second object to be identified using speech model Breath, obtains the second object to be identified and the relevant vector space of industry type, easy to operate and be easily achieved.
Step 310, for the second object to be identified each of in addition to training sample object, by the second object to be identified Vector space, be input in target industry type identification model and learnt, obtain the row that the second object to be identified is subordinate to Industry type.
The implementation procedure of step 309 may refer to the implementation procedure of step 105 in above-described embodiment, and this will not be repeated here.
The industry type recognition methods of the object of the present embodiment, since language model can be from all objects to be identified Carry out the semantic study of paragraph vector in text message, in the vector space after study, word of the same trade and information can to It clusters in quantity space, therefore the target industry type identification model based on limited training sample, remains able to identify The industry type that second object to be identified is subordinate to, so as to avoid in the prior art, industry type identification model can only be to packet Text message containing industry word in existing industry word set is classified, when occurring not existing in the text message of the second object to be identified When industry word in industry word set, then the case where can not being identified by industry type identification model, that is, avoid going in the prior art The recall rate of industry type identification model be trained to data scale it is limited the case where.
As an example, Fig. 4, the application scenarios schematic diagram that Fig. 4 is provided by the embodiment of the present invention four are participated in.Wherein, Argmax is to find the parameter function with maximum scores;Softmax is flexible maximum value transfer function.
Learn as shown in figure 4, unsupervised Doc2vec can be carried out to all objects to be identified (100,000,000 or more), obtains Object to be identified with the relevant vector space of industry type, it is first to be identified right then to be chosen from all objects to be identified As training sample object, it should be noted that in order to enable the training sample object chosen is representative, selection First object to be identified number can reach 10,000,000 or more.Then by way of manually marking, to training sample object institute The industry type being subordinate to is labeled, then be based on the full linking layer algorithms of CNN+, using training sample object vector space and Labeled data is trained the industry type identification model of structure, obtains target industry type identification model.It will finally need The vector space for carrying out the second object to be identified of industry type identification, is input in target industry type identification model and is learned It practises, the identification probability of each industry type can be obtained, and then the corresponding industry type of identification probability maximum value can be selected, make The industry type being subordinate to by the second object to be identified.
In order to realize that above-described embodiment, the present invention also propose a kind of industry type identification device of object.
Fig. 5 is a kind of structural schematic diagram of the industry type identification device of object provided in an embodiment of the present invention.
As shown in figure 5, the industry type identification device 100 of the object includes:First input module 110 chooses module 120, the first acquisition module 130, training module 140, the second input module 150.Wherein,
First input module 110, for the text message of object to be identified to be inputted to the language for generating paragraph vector Learnt in model, obtain object to be identified with the relevant vector space of industry type.
As a kind of possible realization method, the first input module 110, specifically for text message is separately input to base In the language model of algorithms of different structure, the primary vector space of each language model output is obtained;By different language model The primary vector space of output, is combined into the vector space of object to be identified.
Wherein, the dimension in the primary vector space of each language model output is identical.
Module 120 is chosen, for the vector space according to each object to be identified, is chosen from all objects to be identified First object to be identified is as training sample object.
As a kind of possible realization method, module 120 is chosen, is specifically used for the vector space according to object to be identified, The similarity between object to be identified is calculated, is clustered to all objects to be identified according to similarity;From each cluster The first object to be identified is randomly selected, as training sample object.
First acquisition module 130, the labeled data for obtaining training sample object, wherein labeled data is used to indicate Go out the industry type that training sample object is subordinate to.
Training module 140, for the vector space and labeled data using training sample object, to the industry type of structure Identification model is trained, and obtains target industry type identification model.
Second input module 150, for being directed to the second object to be identified each of in addition to training sample object, by second The vector space of object to be identified is input in target industry type identification model and is learnt, it is to be identified right to obtain second As the industry type being subordinate to.
Further, in a kind of possible realization method of the embodiment of the present invention, referring to Fig. 6, embodiment shown in Fig. 5 On the basis of, the industry type identification device 100 of the object can also include:
Module 160 is established, vector space for establishing object to be identified and between the identification information of object to be identified Mapping relations.
Memory module 170, for according to mapping relations, the vector space of object to be identified to be stored in dictionary.
Second acquisition module 180, the identification information for obtaining the second object to be identified.
Enquiry module 190 inquires mapping relations for the identification information according to the second object to be identified, from dictionary To the vector space of the second object to be identified.
It should be noted that the explanation of the aforementioned industry type recognition methods embodiment to object is also applied for the reality The industry type identification device 100 of the object of example is applied, details are not described herein again.
The industry type identification device of the object of the present embodiment, since language model can be from all objects to be identified Carry out the semantic study of paragraph vector in text message, in the vector space after study, word of the same trade and information can to It clusters in quantity space, therefore the target industry type identification model based on limited training sample, remains able to identify The industry type that second object to be identified is subordinate to, so as to avoid in the prior art, industry type identification model can only be to packet Text message containing industry word in existing industry word set is classified, when occurring not existing in the text message of the second object to be identified When industry word in industry word set, then the case where can not being identified by industry type identification model, that is, avoid going in the prior art The recall rate of industry type identification model be trained to data scale it is limited the case where.
In order to realize that above-described embodiment, the present invention also propose a kind of computer equipment, including:Processor and memory;
Wherein, the processor by read the executable program code stored in the memory run with it is described can The corresponding program of program code is executed, for realizing the industry type identification side of the object proposed such as present invention Method.
In order to realize that above-described embodiment, the present invention also propose a kind of non-transitorycomputer readable storage medium, deposit thereon Contain computer program, which is characterized in that pair proposed such as present invention is realized when the program is executed by processor The industry type recognition methods of elephant.
In order to realize that above-described embodiment, the present invention also propose a kind of computer program product, when the computer program produces Instruction processing unit in product realizes the industry type recognition methods of the object proposed such as present invention when executing.
Fig. 7 shows the block diagram of the exemplary computer device suitable for being used for realizing the application embodiment.What Fig. 7 was shown Computer equipment 12 is only an example, should not bring any restrictions to the function and use scope of the embodiment of the present application.
As shown in fig. 7, computer equipment 12 is showed in the form of universal computing device.The component of computer equipment 12 can be with Including but not limited to:One or more processor or processing unit 16, system storage 28 connect different system component The bus 18 of (including system storage 28 and processing unit 16).
Bus 18 indicates one or more in a few class bus structures, including memory bus or Memory Controller, Peripheral bus, graphics acceleration port, processor or the local bus using the arbitrary bus structures in a variety of bus structures.It lifts For example, these architectures include but not limited to industry standard architecture (Industry Standard Architecture;Hereinafter referred to as:ISA) bus, microchannel architecture (Micro Channel Architecture;Below Referred to as:MAC) bus, enhanced isa bus, Video Electronics Standards Association (Video Electronics Standards Association;Hereinafter referred to as:VESA) local bus and peripheral component interconnection (Peripheral Component Interconnection;Hereinafter referred to as:PCI) bus.
Computer equipment 12 typically comprises a variety of computer system readable media.These media can be it is any can be by The usable medium that computer equipment 12 accesses, including volatile and non-volatile media, moveable and immovable medium.
Memory 28 may include the computer system readable media of form of volatile memory, such as random access memory Device (Random Access Memory;Hereinafter referred to as:RAM) 30 and/or cache memory 32.Computer equipment 12 can be with Further comprise other removable/nonremovable, volatile/non-volatile computer system storage mediums.Only as an example, Storage system 34 can be used for reading and writing immovable, non-volatile magnetic media, and (Fig. 7 do not show, commonly referred to as " hard drive Device ").Although being not shown in Fig. 7, can provide for being driven to the disk for moving non-volatile magnetic disk (such as " floppy disk ") read-write Dynamic device, and to removable anonvolatile optical disk (such as:Compact disc read-only memory (Compact Disc Read Only Memory;Hereinafter referred to as:CD-ROM), digital multi CD-ROM (Digital Video Disc Read Only Memory;Hereinafter referred to as:DVD-ROM) or other optical mediums) read-write CD drive.In these cases, each driving Device can be connected by one or more data media interfaces with bus 18.Memory 28 may include at least one program production Product, the program product have one group of (for example, at least one) program module, and it is each that these program modules are configured to perform the application The function of embodiment.
Program/utility 40 with one group of (at least one) program module 42 can be stored in such as memory 28 In, such program module 42 include but not limited to operating system, one or more application program, other program modules and Program data may include the realization of network environment in each or certain combination in these examples.Program module 42 is usual Execute the function and/or method in embodiments described herein.
Computer equipment 12 can also be with one or more external equipments 14 (such as keyboard, sensing equipment, display 24 Deng) communication, can also be enabled a user to one or more equipment interact with the computer equipment 12 communicate, and/or with make The computer equipment 12 any equipment (such as network interface card, the modulatedemodulate that can be communicated with one or more of the other computing device Adjust device etc.) communication.This communication can be carried out by input/output (I/O) interface 22.Also, computer equipment 12 may be used also To pass through network adapter 20 and one or more network (such as LAN (Local Area Network;Hereinafter referred to as: LAN), wide area network (Wide Area Network;Hereinafter referred to as:WAN) and/or public network, for example, internet) communication.Such as figure Shown, network adapter 20 is communicated by bus 18 with other modules of computer equipment 12.It should be understood that although not showing in figure Go out, other hardware and/or software module can be used in conjunction with computer equipment 12, including but not limited to:Microcode, device drives Device, redundant processing unit, external disk drive array, RAID system, tape drive and data backup storage system etc..
Processing unit 16 is stored in program in system storage 28 by operation, to perform various functions application and Data processing, such as realize the industry type recognition methods of the object referred in previous embodiment.
In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show The description of example " or " some examples " etc. means specific features, structure, material or spy described in conjunction with this embodiment or example Point is included at least one embodiment or example of the invention.In the present specification, schematic expression of the above terms are not It must be directed to identical embodiment or example.Moreover, particular features, structures, materials, or characteristics described can be in office It can be combined in any suitable manner in one or more embodiments or example.In addition, without conflicting with each other, the skill of this field Art personnel can tie the feature of different embodiments or examples described in this specification and different embodiments or examples It closes and combines.
In addition, term " first ", " second " are used for description purposes only, it is not understood to indicate or imply relative importance Or implicitly indicate the quantity of indicated technical characteristic.Define " first " as a result, the feature of " second " can be expressed or Implicitly include at least one this feature.In the description of the present invention, the meaning of " plurality " is at least two, such as two, three It is a etc., unless otherwise specifically defined.
Any process described otherwise above or method description are construed as in flow chart or herein, and expression includes It is one or more for realizing custom logic function or process the step of executable instruction code module, segment or portion Point, and the range of the preferred embodiment of the present invention includes other realization, wherein can not press shown or discuss suitable Sequence, include according to involved function by it is basic simultaneously in the way of or in the opposite order, to execute function, this should be of the invention Embodiment person of ordinary skill in the field understood.
Expression or logic and/or step described otherwise above herein in flow charts, for example, being considered use In the order list for the executable instruction for realizing logic function, may be embodied in any computer-readable medium, for Instruction execution system, device or equipment (system of such as computer based system including processor or other can be held from instruction The instruction fetch of row system, device or equipment and the system executed instruction) it uses, or combine these instruction execution systems, device or set It is standby and use.For the purpose of this specification, " computer-readable medium " can any can be included, store, communicating, propagating or passing Defeated program is for instruction execution system, device or equipment or the dress used in conjunction with these instruction execution systems, device or equipment It sets.The more specific example (non-exhaustive list) of computer-readable medium includes following:Electricity with one or more wiring Interconnecting piece (electronic device), portable computer diskette box (magnetic device), random access memory (RAM), read-only memory (ROM), erasable edit read-only storage (EPROM or flash memory), fiber device and portable optic disk is read-only deposits Reservoir (CDROM).In addition, computer-readable medium can even is that the paper that can print described program on it or other are suitable Medium, because can be for example by carrying out optical scanner to paper or other media, then into edlin, interpretation or when necessary with it His suitable method is handled electronically to obtain described program, is then stored in computer storage.
It should be appreciated that each section of the present invention can be realized with hardware, software, firmware or combination thereof.Above-mentioned In embodiment, software that multiple steps or method can in memory and by suitable instruction execution system be executed with storage Or firmware is realized.Such as, if realized in another embodiment with hardware, following skill well known in the art can be used Any one of art or their combination are realized:With for data-signal realize logic function logic gates from Logic circuit is dissipated, the application-specific integrated circuit with suitable combinational logic gate circuit, programmable gate array (PGA), scene can compile Journey gate array (FPGA) etc..
Those skilled in the art are appreciated that realize all or part of step that above-described embodiment method carries Suddenly it is that relevant hardware can be instructed to complete by program, the program can be stored in a kind of computer-readable storage medium In matter, which includes the steps that one or a combination set of embodiment of the method when being executed.
In addition, each functional unit in each embodiment of the present invention can be integrated in a processing module, it can also That each unit physically exists alone, can also two or more units be integrated in a module.Above-mentioned integrated mould The form that hardware had both may be used in block is realized, can also be realized in the form of software function module.The integrated module is such as Fruit is realized in the form of software function module and when sold or used as an independent product, can also be stored in a computer In read/write memory medium.
Storage medium mentioned above can be read-only memory, disk or CD etc..Although having been shown and retouching above The embodiment of the present invention is stated, it is to be understood that above-described embodiment is exemplary, and should not be understood as the limit to the present invention System, those skilled in the art can be changed above-described embodiment, change, replace and become within the scope of the invention Type.

Claims (10)

1. a kind of industry type recognition methods of object, which is characterized in that including:
The text message of object to be identified is inputted in the language model for generating paragraph vector and is learnt, described wait for is obtained Identify object with the relevant vector space of industry type;
According to the vector space of each object to be identified, the first object to be identified is chosen from all objects to be identified as instruction Practice sample object;
Obtain the labeled data of the training sample object, wherein the labeled data is used to indicate out the training sample pair As the industry type being subordinate to;
The vector space using the training sample object and the labeled data, to the industry type identification model of structure It is trained, obtains target industry type identification model;
For the second object to be identified each of in addition to the training sample object, by the described second object to be identified to Quantity space is input in the target industry type identification model and is learnt, and obtains second object to be identified and is subordinate to Industry type.
2. according to the method described in claim 1, it is characterized in that, described obtaining the object to be identified with industry type phase After the vector space of pass, further include:
Establish the mapping relations between the vector space of the object to be identified and the identification information of the object to be identified;
According to the mapping relations, the vector space of the object to be identified is stored in dictionary.
3. according to the method described in claim 1, it is characterized in that, the text message by object to be identified is inputted for giving birth to In language model at paragraph vector, obtain the object to be identified with the relevant vector space of industry type, including:
The text message is separately input in the language model built based on algorithms of different, each language model is obtained The primary vector space of output;
The primary vector space that different language model is exported, is combined into the vector space of the object to be identified.
4. according to the method described in claim 3, it is characterized in that, the primary vector of each language model output is empty Between dimension it is identical.
5. according to the method described in claim 1, it is characterized in that, the vector space of each object to be identified of the basis, from The first object to be identified is chosen in all objects to be identified as training sample object, including:
According to the vector space of the object to be identified, the similarity between the object to be identified is calculated, according to described similar Degree clusters to all objects to be identified;
The first object to be identified is randomly selected from each cluster, as the training sample object.
6. according to the method described in claim 2, it is characterized in that, the vector space by second object to be identified, Before being input in the target industry type identification model, further include:
Obtain the identification information of second object to be identified;
According to the identification information of second object to be identified, the mapping relations are inquired, described is obtained from the dictionary The vector space of two objects to be identified.
7. a kind of industry type identification device of object, which is characterized in that including:
First input module, for the text message of object to be identified is inputted in the language model for generating paragraph vector into Row study, obtain the object to be identified with the relevant vector space of industry type;
Module is chosen, the vector space according to each object to be identified is used for, first is chosen from all objects to be identified and is waited for Identify object as training sample object;
Acquisition module, the labeled data for obtaining the training sample object, wherein the labeled data is used to indicate out institute State the industry type that training sample object is subordinate to;
Training module, the vector space for utilizing the training sample object and the labeled data, to the row of structure Industry type identification model is trained, and obtains target industry type identification model;;
Second input module, for for the second object to be identified each of in addition to the training sample object, by described the The vector space of two objects to be identified is input in the target industry type identification model and is learnt, and obtains described The industry type that two objects to be identified are subordinate to.
8. a kind of computer equipment, which is characterized in that including processor and memory;
Wherein, the processor can perform to run with described by reading the executable program code stored in the memory The corresponding program of program code, for realizing the industry type recognition methods of the object as described in any in claim 1-6.
9. a kind of computer program product, which is characterized in that when the instruction processing unit in the computer program product executes Realize the industry type recognition methods of the object as described in any in claim 1-6.
10. a kind of non-transitorycomputer readable storage medium, is stored thereon with computer program, which is characterized in that the program The industry type recognition methods of the object as described in any in claim 1-6 is realized when being executed by processor.
CN201810420223.1A 2018-05-04 2018-05-04 Industry type identification method and device of object Active CN108733778B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810420223.1A CN108733778B (en) 2018-05-04 2018-05-04 Industry type identification method and device of object

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810420223.1A CN108733778B (en) 2018-05-04 2018-05-04 Industry type identification method and device of object

Publications (2)

Publication Number Publication Date
CN108733778A true CN108733778A (en) 2018-11-02
CN108733778B CN108733778B (en) 2022-05-17

Family

ID=63937073

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810420223.1A Active CN108733778B (en) 2018-05-04 2018-05-04 Industry type identification method and device of object

Country Status (1)

Country Link
CN (1) CN108733778B (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109670267A (en) * 2018-12-29 2019-04-23 北京航天数据股份有限公司 A kind of data processing method and device
CN109960808A (en) * 2019-03-26 2019-07-02 广东工业大学 A kind of text recognition method, device, equipment and computer readable storage medium
CN110009364A (en) * 2019-01-08 2019-07-12 阿里巴巴集团控股有限公司 A kind of industry identification model determines method and apparatus
CN110188357A (en) * 2019-05-31 2019-08-30 阿里巴巴集团控股有限公司 The industry recognition methods of object and device
CN111444334A (en) * 2019-01-16 2020-07-24 阿里巴巴集团控股有限公司 Data processing method, text recognition device and computer equipment
CN111523315A (en) * 2019-01-16 2020-08-11 阿里巴巴集团控股有限公司 Data processing method, text recognition device and computer equipment
CN112115710A (en) * 2019-06-03 2020-12-22 腾讯科技(深圳)有限公司 Industry information identification method and device
CN112148959A (en) * 2019-06-27 2020-12-29 百度在线网络技术(北京)有限公司 Information recommendation method and device
CN112417150A (en) * 2020-11-16 2021-02-26 建信金融科技有限责任公司 Industry classification model training and using method, device, equipment and medium
CN112819106A (en) * 2021-04-16 2021-05-18 江西博微新技术有限公司 IFC component type identification method, device, storage medium and equipment
CN113377904A (en) * 2021-06-04 2021-09-10 百度在线网络技术(北京)有限公司 Industry action recognition method and device, electronic equipment and storage medium
CN113807749A (en) * 2021-11-19 2021-12-17 北京金堤科技有限公司 Object scoring method and device
CN117216688A (en) * 2023-11-07 2023-12-12 西南科技大学 Enterprise industry identification method and system based on hierarchical label tree and neural network

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103150454A (en) * 2013-03-27 2013-06-12 山东大学 Dynamic machine learning modeling method based on sample recommending and labeling
CN104834940A (en) * 2015-05-12 2015-08-12 杭州电子科技大学 Medical image inspection disease classification method based on support vector machine (SVM)
US20160253596A1 (en) * 2015-02-26 2016-09-01 International Business Machines Corporation Geometry-directed active question selection for question answering systems
CN107193959A (en) * 2017-05-24 2017-09-22 南京大学 A kind of business entity's sorting technique towards plain text
CN107885853A (en) * 2017-11-14 2018-04-06 同济大学 A kind of combined type file classification method based on deep learning

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103150454A (en) * 2013-03-27 2013-06-12 山东大学 Dynamic machine learning modeling method based on sample recommending and labeling
US20160253596A1 (en) * 2015-02-26 2016-09-01 International Business Machines Corporation Geometry-directed active question selection for question answering systems
CN104834940A (en) * 2015-05-12 2015-08-12 杭州电子科技大学 Medical image inspection disease classification method based on support vector machine (SVM)
CN107193959A (en) * 2017-05-24 2017-09-22 南京大学 A kind of business entity's sorting technique towards plain text
CN107885853A (en) * 2017-11-14 2018-04-06 同济大学 A kind of combined type file classification method based on deep learning

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109670267A (en) * 2018-12-29 2019-04-23 北京航天数据股份有限公司 A kind of data processing method and device
CN110009364A (en) * 2019-01-08 2019-07-12 阿里巴巴集团控股有限公司 A kind of industry identification model determines method and apparatus
CN111444334B (en) * 2019-01-16 2023-04-25 阿里巴巴集团控股有限公司 Data processing method, text recognition device and computer equipment
CN111444334A (en) * 2019-01-16 2020-07-24 阿里巴巴集团控股有限公司 Data processing method, text recognition device and computer equipment
CN111523315A (en) * 2019-01-16 2020-08-11 阿里巴巴集团控股有限公司 Data processing method, text recognition device and computer equipment
CN111523315B (en) * 2019-01-16 2023-04-18 阿里巴巴集团控股有限公司 Data processing method, text recognition device and computer equipment
CN109960808B (en) * 2019-03-26 2023-02-07 广东工业大学 Text recognition method, device and equipment and computer readable storage medium
CN109960808A (en) * 2019-03-26 2019-07-02 广东工业大学 A kind of text recognition method, device, equipment and computer readable storage medium
CN110188357A (en) * 2019-05-31 2019-08-30 阿里巴巴集团控股有限公司 The industry recognition methods of object and device
CN110188357B (en) * 2019-05-31 2023-06-20 创新先进技术有限公司 Industry identification method and device for objects
CN112115710A (en) * 2019-06-03 2020-12-22 腾讯科技(深圳)有限公司 Industry information identification method and device
CN112115710B (en) * 2019-06-03 2023-08-08 腾讯科技(深圳)有限公司 Industry information identification method and device
CN112148959A (en) * 2019-06-27 2020-12-29 百度在线网络技术(北京)有限公司 Information recommendation method and device
CN112417150A (en) * 2020-11-16 2021-02-26 建信金融科技有限责任公司 Industry classification model training and using method, device, equipment and medium
CN112819106B (en) * 2021-04-16 2021-07-13 江西博微新技术有限公司 IFC component type identification method, device, storage medium and equipment
CN112819106A (en) * 2021-04-16 2021-05-18 江西博微新技术有限公司 IFC component type identification method, device, storage medium and equipment
CN113377904A (en) * 2021-06-04 2021-09-10 百度在线网络技术(北京)有限公司 Industry action recognition method and device, electronic equipment and storage medium
CN113377904B (en) * 2021-06-04 2024-05-10 百度在线网络技术(北京)有限公司 Industry action recognition method and device, electronic equipment and storage medium
CN113807749A (en) * 2021-11-19 2021-12-17 北京金堤科技有限公司 Object scoring method and device
CN117216688A (en) * 2023-11-07 2023-12-12 西南科技大学 Enterprise industry identification method and system based on hierarchical label tree and neural network
CN117216688B (en) * 2023-11-07 2024-01-23 西南科技大学 Enterprise industry identification method and system based on hierarchical label tree and neural network

Also Published As

Publication number Publication date
CN108733778B (en) 2022-05-17

Similar Documents

Publication Publication Date Title
CN108733778A (en) The industry type recognition methods of object and device
WO2022022163A1 (en) Text classification model training method, device, apparatus, and storage medium
US8682896B2 (en) Smart attribute classification (SAC) for online reviews
CN110245348A (en) A kind of intension recognizing method and system
CN108319720A (en) Man-machine interaction method, device based on artificial intelligence and computer equipment
Liang et al. AC-BLSTM: asymmetric convolutional bidirectional LSTM networks for text classification
CN108563655A (en) Text based event recognition method and device
JP2020149686A (en) Image processing method, device, server, and storage medium
CN108460098A (en) Information recommendation method, device and computer equipment
CN111222318A (en) Trigger word recognition method based on two-channel bidirectional LSTM-CRF network
CN110008365B (en) Image processing method, device and equipment and readable storage medium
WO2022203899A1 (en) Document distinguishing based on page sequence learning
CN107992602A (en) Search result methods of exhibiting and device
CN113051914A (en) Enterprise hidden label extraction method and device based on multi-feature dynamic portrait
Kumar et al. BERT based semi-supervised hybrid approach for aspect and sentiment classification
CN113139664A (en) Cross-modal transfer learning method
Joshua Thomas et al. A deep learning framework on generation of image descriptions with bidirectional recurrent neural networks
CN109815500A (en) Management method, device, computer equipment and the storage medium of unstructured official document
WO2014073206A1 (en) Information-processing device and information-processing method
CN113849653A (en) Text classification method and device
US11321397B2 (en) Composition engine for analytical models
Tüselmann et al. Recognition-free question answering on handwritten document collections
CN112668633B (en) Adaptive graph migration learning method based on fine granularity field
CN105844207B (en) Line of text extracting method and line of text extract equipment
US20240028952A1 (en) Apparatus for attribute path generation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant