CN113724055A - Commodity attribute mining method and device - Google Patents
Commodity attribute mining method and device Download PDFInfo
- Publication number
- CN113724055A CN113724055A CN202111076600.2A CN202111076600A CN113724055A CN 113724055 A CN113724055 A CN 113724055A CN 202111076600 A CN202111076600 A CN 202111076600A CN 113724055 A CN113724055 A CN 113724055A
- Authority
- CN
- China
- Prior art keywords
- commodity
- attribute
- sentence
- short sentence
- short
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 54
- 238000005065 mining Methods 0.000 title claims abstract description 43
- 239000013598 vector Substances 0.000 claims abstract description 91
- 238000013507 mapping Methods 0.000 claims abstract description 71
- 238000012549 training Methods 0.000 claims abstract description 66
- 238000004590 computer program Methods 0.000 claims description 9
- 238000013526 transfer learning Methods 0.000 claims description 9
- 230000006870 function Effects 0.000 claims description 8
- 238000012015 optical character recognition Methods 0.000 claims description 8
- 238000003064 k means clustering Methods 0.000 claims description 6
- 238000012545 processing Methods 0.000 claims description 3
- 230000008569 process Effects 0.000 description 11
- 238000007726 management method Methods 0.000 description 8
- 238000010586 diagram Methods 0.000 description 5
- 230000003287 optical effect Effects 0.000 description 5
- 230000000694 effects Effects 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 238000004891 communication Methods 0.000 description 3
- 238000013508 migration Methods 0.000 description 3
- 230000005012 migration Effects 0.000 description 3
- 238000009412 basement excavation Methods 0.000 description 2
- 239000003086 colorant Substances 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000008451 emotion Effects 0.000 description 1
- 230000014509 gene expression Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/06—Buying, selling or leasing transactions
- G06Q30/0601—Electronic shopping [e-shopping]
- G06Q30/0623—Item investigation
- G06Q30/0625—Directed, with specific intent or strategy
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Business, Economics & Management (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Accounting & Taxation (AREA)
- Finance (AREA)
- Marketing (AREA)
- Artificial Intelligence (AREA)
- Strategic Management (AREA)
- Economics (AREA)
- Probability & Statistics with Applications (AREA)
- Development Economics (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Business, Economics & Management (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The disclosure provides a commodity attribute mining method and device. The method comprises the following steps: determining a mapping relation between the commodity attribute number and a short sentence describing the commodity attribute; inputting the short sentence into a preset vectorization model to obtain a sentence vector which is output by the vectorization model and represents the semantics of the short sentence; the vectorization model is obtained by training a short sentence set of labeled attribute data corresponding to the target commodity class on the basis of a pre-training network model; and determining the commodity attribute characteristics corresponding to the commodity attribute numbers based on the commodity public attribute characteristics corresponding to the cluster of the sentence vectors and the mapping relation. The method provided by the disclosure can effectively analyze the unstructured data, fully excavate the attribute features hidden by the commodity, and improve the efficiency of excavating the commodity attributes.
Description
Technical Field
The disclosure relates to the technical field of big data analysis, in particular to a commodity attribute mining method and device. In addition, an electronic device and a processor-readable storage medium are also related.
Background
For each item sold on the network platform, in addition to the structured information in the database describing the characteristics of the item, some unstructured information is available, the most important of which is the information in the picture of the item detail page. Generally, each product attribute number corresponds to a plurality of product detail page pictures, each product detail page picture comprises a plurality of advertisement sentences or explanatory sentences, and each sentence describes the characteristics of a certain dimension of a product. How to dig out the commodity attribute features expressed by each sentence based on the corresponding relation and further reversely push out the hidden attribute features of each commodity attribute number, so that the problems of commodity recommendation, commodity customization, selling point digging and the like become urgent to solve.
Disclosure of Invention
Therefore, the invention provides a commodity attribute mining method and a commodity attribute mining device, which aim to overcome the defects that the manual mining scheme and the computer-assisted mining scheme in the prior art are high in limitation, time-consuming and labor-consuming, prone to generating honor attributes and requiring more manual intervention, so that the commodity attribute mining efficiency and stability are poor.
The present disclosure provides a method for mining commodity attributes, including:
determining a mapping relation between the commodity attribute number and a short sentence describing the commodity attribute; wherein the short sentence is in the commodity detail page picture;
inputting the short sentence into a preset vectorization model to obtain a sentence vector which is output by the vectorization model and represents the semantics of the short sentence;
the vectorization model is obtained by training a short sentence set of labeled attribute data corresponding to the target commodity class on the basis of a pre-training network model;
and determining the commodity attribute characteristics corresponding to the commodity attribute numbers based on the commodity public attribute characteristics corresponding to the cluster of the sentence vectors and the mapping relation.
Further, determining the commodity attribute characteristics corresponding to the commodity attribute numbers based on the commodity public attribute characteristics corresponding to the cluster of the sentence vectors and the mapping relationship, specifically including:
clustering the sentence vectors to obtain corresponding clustering clusters;
extracting commodity attribute features of the short sentence commonalities in the clustering cluster, and taking the commodity attribute features of the commonalities as the commodity public attribute features corresponding to the clustering cluster;
and matching the commodity public attribute characteristics corresponding to the cluster with the mapping relation to obtain the commodity attribute characteristics of the commodity attribute numbers.
Further, the clustering the sentence vectors to obtain corresponding clustering clusters specifically includes: based on the semantics of the sentence corresponding to the sentence vector, clustering the sentence vector by using a K-means clustering model to obtain a clustering cluster containing a plurality of sentence vectors; and the semantics of the short sentences corresponding to the sentence vectors in the clustering cluster meet a preset semantic approximate condition.
Further, the determining the mapping relationship between the commodity attribute number and the short sentence describing the commodity attribute specifically includes:
mapping the commodity detail page picture to the commodity attribute number to obtain an initial mapping relation between the commodity attribute number and the commodity detail page picture;
recognizing the text information of the commodity detail page picture by using an optical character recognition mode to obtain a short sentence which is corresponding to the commodity detail page picture and describes the commodity attribute;
and determining the mapping relation according to the initial mapping relation and a short sentence which is corresponding to the commodity detail page picture and describes the commodity attribute.
Further, the product attribute mining method further includes:
based on a preset triple loss function, gathering sentence vectors corresponding to similar phrases in the phrases in a space, keeping away the sentence vectors corresponding to non-similar phrases in the space, and adding corresponding pseudo marks according to the spatial distance for sample sentences not labeled with attribute data in the phrases so as to determine a training sample set of the pre-training network model;
training the pre-training network model based on the training sample set.
Further, based on the short sentence set marked with the attribute data, the pre-training network model is trained in a transfer learning mode to obtain the vectorization model.
Further, the product attribute number is stock management information of the product, and the stock management information refers to a numeric code or an alphabetic code for uniquely identifying the product.
The present disclosure also provides a commodity attribute excavating apparatus, including:
the mapping relation determining unit is used for determining the mapping relation between the commodity attribute number and the short sentence describing the commodity attribute; wherein the short sentence is in the commodity detail page picture;
the vectorization processing unit is used for inputting the short sentence into a preset vectorization model to obtain a sentence vector which is output by the vectorization model and represents the semantic meaning of the short sentence;
the vectorization model is obtained by training a short sentence set of labeled attribute data corresponding to the target commodity class on the basis of a pre-training network model;
and the commodity attribute characteristic determining unit is used for determining the commodity attribute characteristics corresponding to the commodity attribute numbers based on the commodity public attribute characteristics corresponding to the clustering clusters of the sentence vectors and the mapping relation.
Further, the commodity attribute feature determining unit is specifically configured to:
clustering the sentence vectors to obtain corresponding clustering clusters;
extracting commodity attribute features of the short sentence commonalities in the clustering cluster, and taking the commodity attribute features of the commonalities as the commodity public attribute features corresponding to the clustering cluster;
and matching the commodity public attribute characteristics corresponding to the cluster with the mapping relation to obtain the commodity attribute characteristics of the commodity attribute numbers.
Further, the clustering the sentence vectors to obtain corresponding clustering clusters specifically includes: based on the semantics of the sentence corresponding to the sentence vector, clustering the sentence vector by using a K-means clustering model to obtain a clustering cluster containing a plurality of sentence vectors; and the semantics of the short sentences corresponding to the sentence vectors in the clustering cluster meet a preset semantic approximate condition.
Further, the mapping relationship determining unit is specifically configured to:
mapping the commodity detail page picture to the commodity attribute number to obtain an initial mapping relation between the commodity attribute number and the commodity detail page picture;
recognizing the text information of the commodity detail page picture by using an optical character recognition mode to obtain a short sentence which is corresponding to the commodity detail page picture and describes the commodity attribute;
and determining the mapping relation according to the initial mapping relation and a short sentence which is corresponding to the commodity detail page picture and describes the commodity attribute.
Further, the product attribute mining device further includes:
a sample set determining unit, configured to gather, in a space, sentence vectors corresponding to similar phrases in the phrases based on a preset triple loss function, and to keep away, in the space, sentence vectors corresponding to non-similar phrases in the phrases, and to add corresponding pseudo labels to sample sentences not labeled with attribute data in the phrases according to spatial distances, so as to determine a training sample set of the pre-training network model;
and the model training unit is used for training the pre-training network model based on the training sample set.
Further, based on the short sentence set marked with the attribute data, the pre-training network model is trained in a transfer learning mode to obtain the vectorization model.
Further, the product attribute number is stock management information of the product, and the stock management information refers to a numeric code or an alphabetic code for uniquely identifying the product.
The present disclosure also provides an electronic device, comprising: a memory, a processor and a computer program stored on the memory and operable on the processor, the processor implementing the steps of the method for mining the attributes of a good as described in any one of the above.
The present disclosure also provides a processor-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method for mining the attributes of goods as described in any one of the above.
The commodity attribute mining method provided by the disclosure obtains a sentence vector representing the semantics of a short sentence by determining the mapping relation between a commodity attribute number and the short sentence used for describing the commodity attribute in a commodity detail page picture and inputting the short sentence into a preset vectorization model; clustering the sentence vectors, and determining commodity attribute characteristics corresponding to the commodity attribute numbers based on the commodity public attribute characteristics corresponding to the clustering clusters of the sentence vectors and the mapping relation; the vectorization model is obtained by training with a short sentence set of labeled attribute data corresponding to the target commodity class on the basis of a pre-training network model. The method can effectively analyze the unstructured data, fully excavate the attribute features hidden by the commodity, and improve the efficiency of excavating the commodity attributes.
Drawings
In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present disclosure, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic flow chart of a method for mining commodity attributes according to an embodiment of the present disclosure;
fig. 2 is a complete flow diagram of a method for mining commodity attributes according to an embodiment of the present disclosure;
fig. 3 is a schematic structural diagram of a product attribute mining device according to an embodiment of the present disclosure;
fig. 4 is a schematic physical structure diagram of an electronic device provided in an embodiment of the present disclosure.
Detailed Description
To make the objects, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be described clearly and completely with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are some, but not all embodiments of the present disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.
The following describes an embodiment of the product attribute mining method in detail based on the present disclosure. As shown in fig. 1, which is a schematic flow chart of a method for mining an attribute of a commodity according to an embodiment of the present disclosure, a specific implementation process includes the following steps:
step 101: and determining the mapping relation between the commodity attribute number and a short sentence used for describing the commodity attribute in the commodity detail page picture.
Specifically, the Stock Keeping Unit (SKU), that is, the Stock Keeping management information, refers to a numeric code or an alphabetical code assigned to a product to uniquely identify the product attribute, so that an enterprise can manage Stock more easily and efficiently. The number or code of the item attribute number is typically between 8 and 12 characters and is located on the price label of the item. The commodity detail page picture is a picture containing commodity detail information on an online sales counter, and the commodity detail information is unstructured information. For each item, in addition to the structured information in the database describing the characteristics of the item, some unstructured information is available, the most important of which is the information in the item detail page picture. Each commodity attribute number corresponds to at least one commodity detail page picture, each commodity detail page picture comprises a plurality of advertisement sentences or explanatory sentences, and each sentence describes a commodity attribute feature (or a nonsensical sentence) of a certain dimension of a commodity. The short sentence is the advertisement sentences or the explanatory sentences, and each commodity detail page picture corresponds to at least one short sentence describing the commodity attribute.
In this step, the product detail page picture is mapped to the product attribute number to obtain an initial mapping relationship between the product attribute number and the product detail page picture. Further, identifying the text information of the commodity detail page picture to obtain a short sentence corresponding to the commodity detail page picture and describing the commodity attribute, and determining the mapping relation among the commodity attribute number, the commodity detail page picture and the short sentence describing the commodity attribute according to the initial mapping relation and the short sentence corresponding to the commodity detail page picture and describing the commodity attribute.
It should be noted that, in the practical implementation process of the present invention, the text information of the item detail page picture may be recognized by a manner including, but not limited to, OCR (Optical Character Recognition). Specifically, when the text information of the commodity detail page picture is identified in an optical character identification mode, each commodity detail page picture can be mapped to the corresponding commodity attribute number through related operations of the database, and the text information on the commodity detail page picture is read through an optical character identification technology. Generally, one product attribute number corresponds to a plurality of product detail page pictures, and one product detail page picture corresponds to a plurality of OCR phrases, i.e. phrases describing the product attribute. After the text information on the commodity detail page picture is read, the text information can be cleaned, short sentences of which the fonts do not meet preset conditions are filtered, short sentences which can be directly used for inputting a vectorization model and are used for describing commodity attributes are obtained, and finally the mapping relation of the commodity attribute numbers, the commodity detail page picture and the short sentences is obtained.
Step 102: and inputting the short sentence into a preset vectorization model to obtain a sentence vector which is output by the vectorization model and represents the semantics of the short sentence. The vectorization model is obtained by training a short sentence set of labeled attribute data corresponding to the target commodity class on the basis of a pre-training network model.
In the embodiment of the invention, before the step is executed, fine-tuning (fine-tuning) is performed by using artificially labeled commodity attribute data on the basis of a Bert model in advance to obtain the finely tuned Bert model, a vectorization model capable of effectively distinguishing short sentence semantics is further trained, clustering is performed by using output sentence vectors of the vectorization model, and a plurality of clustering clusters of short sentences describing different commodity attributes are obtained based on clustering results, so that hidden commodity attribute information can be mined on new unlabeled data.
It should be noted that the Bert model is a pre-trained model. Assuming that a training set A exists, the initial network model is pre-trained by the training set A, network parameters are learned on a task of the training set A and then stored for later use, when a new task B comes, the same network structure can be adopted, the network parameters learned by the training set A can be loaded when the network parameters are initialized, then the network model is trained by training data of the task B, when the loaded parameters are kept unchanged, the loaded parameters are called as "freqen", and when the loaded network parameters are continuously fine-tuned along with the training of the task B, the loaded network parameters are called as "fine-tuning", namely, the network parameters are better adjusted to be more suitable for the current task B. In the embodiment of the present invention, the training set a may refer to a short sentence set of labeled attribute data corresponding to the target commodity category; the B training set corresponding to the task B may refer to an unlabeled sample statement set.
As shown in fig. 2, in a specific implementation process, the Bert model refers to an embedding model of NLP, and a corresponding output sentence vector can be obtained by inputting a short sentence, wherein a Class label of an output first dimension can represent category information of the whole short sentence, and short sentences with similar semantics are closer in a vector space. The invention can adopt NLP-Bert-as-service model, also can use Bert model (Chinese-BERT-wwm), the said model has already trained in Chinese character data set of billion level, therefore can obtain the semantic information in the short sentence well.
In this step, after the short sentence is input into a preset vectorization model to obtain a sentence vector which is output by the vectorization model and represents the semantic of the short sentence, the sentence vectors corresponding to the similar short sentences in the short sentence can be gathered in a space based on a preset triple loss function, the sentence vectors corresponding to the non-similar short sentences in the short sentence are kept away in the space, corresponding pseudo labels are added according to the spatial distance for sample sentences which are not labeled with attribute data in the short sentence to determine a training sample set of the pre-training network model, and the pre-training network model is trained based on the training sample set.
It should be noted that, in the actual implementation process, because the input short sentence is related to the text information of a specific commodity class, some professional words and specific field expressions may appear, which may affect the effect of the Bert model. Therefore, in the embodiment of the present invention, based on the phrase set partially subjected to the manual attribute information labeling (i.e. the phrase set labeled with the attribute data corresponding to the target commodity category), the phrase set labeled with the attribute data is used for performing the transfer learning on the basis of the Class label output by the Bert model, so that the final output vector better meets the requirements of the field to which the target commodity category belongs.
Because the output form of the vectorization model is a sentence vector representing the semantics of the short sentences and the category marking information of the short sentence set which is partially marked manually exists, a triple loss function (triple loss) can be selected as a loss function in the specific implementation process. The system hopes that the finally obtained commodity attribute information is not only limited to manually labeled categories, but also hopes that new categories which are not manually labeled can be automatically learned. The principle of triple loss is: each time a triplet (a, b, c) is selected, wherein a, b belong to the same category, a, c belong to different categories, and the loss function is the distance between a and b minus the distance between a and c plus a margin value, by means of continuous back propagation, the distance between a and b is as close as possible to a margin distance above the distance between a and c. Finally, the vectors corresponding to the similar clauses are gathered in the space, and the vectors corresponding to the non-similar clauses are far away in the space. The labeled data here is often a relatively small amount that is labeled manually, so the nearest sample and the farthest sample are calculated by embedding each unlabeled data, and then provided with the pseudo-label of the Triplet loss training. Specifically, the training set of the fine-tuned Bert model is gradually increased by using the nearest samples as positive and the farthest samples as negative. In addition, in the migration learning process, the migration learning layer can select two layers of fully-connected networks to achieve good effect, a Class label vector output by the Bert model is input, and a final embedding vector, namely a sentence vector, is output.
Step 103: and determining the commodity attribute characteristics corresponding to the commodity attribute numbers based on the commodity public attribute characteristics corresponding to the cluster of the sentence vectors and the mapping relation.
In the embodiment of the present invention, the commodity attribute feature corresponding to the commodity attribute number is determined based on the commodity public attribute feature corresponding to the cluster of the sentence vector and the mapping relationship, and the specific implementation process includes: clustering the sentence vectors to obtain corresponding clustering clusters; extracting commodity attribute features of the short sentence commonalities in the clustering cluster, and taking the commodity attribute features of the commonalities as the commodity public attribute features corresponding to the clustering cluster; and matching the commodity public attribute characteristics corresponding to the cluster with the mapping relation to obtain the commodity attribute characteristics of the commodity attribute numbers. After the short sentence is vectorized, corresponding relation exists between the short sentence and the obtained corresponding sentence vector, so that the corresponding relation between the cluster formed by the sentence vector and the short sentence can be obtained, and further the corresponding commodity attribute characteristics can be determined by matching the corresponding relation with the mapping relation. The common commodity attribute feature refers to the commodity attribute feature of a plurality of short sentences in the cluster.
The sentence vectors are clustered to obtain corresponding clustering clusters, and the specific implementation process comprises the following steps: based on the semantics of the sentence corresponding to the sentence vector, clustering the sentence vector to obtain a cluster containing a plurality of sentence vectors; and the semantics of the short sentences corresponding to the sentence vectors in the clustering cluster meet a preset semantic approximate condition.
It should be noted that, in the specific implementation process of the present invention, the sentence vectors are clustered by using a mode including, but not limited to, a K-means Clustering model, i.e., a K-means Clustering Algorithm (Kmeans Clustering Algorithm), which is not limited herein.
As shown in fig. 2, the emotion result obtained by migration learning based on the Bert model is clustered based on a preset clustering algorithm to obtain a plurality of clusters, and the short sentence semantics in each cluster are similar. Due to the transfer learning in step S102, the vectorization model not only can recognize the general semantics of chinese, but also can be more sensitive to the words of specific commodities. Therefore, the short sentences can be clustered into corresponding cluster clusters by selecting the preset hyper-parameter, namely the cluster number n. Finally, common commodity attribute characteristics can be extracted by analyzing the short sentences in each cluster and are used as the common attribute characteristics of the commodities corresponding to the short sentences in the cluster; and determining the commodity attribute characteristics corresponding to the commodity attribute numbers (namely sku1 and sku2) based on the commodity public attribute characteristics corresponding to the cluster of the sentence vectors and the mapping relation.
In a specific implementation process, the commodity attribute mining method provided by the invention can be practically applied to an attribute mining task of commodities such as an electric toothbrush and an electric shaver, and the type of the commodities which are practically applied is not specifically limited. Taking the electric toothbrush as an example, if the obtained product attribute number is MS 12446103; collecting commodity detail pages of the electric toothbrush, and identifying phrases used for describing the attributes of the electric toothbrush in the commodity detail page pictures based on an optical character recognition technology, wherein the phrases include but are not limited to: the duration is 60 hours, the duration is improved by 50 percent, the colors are pink and blue, the vibration frequency reaches 31000 minutes, about 40000 minutes high-speed sound wave vibration and high-efficiency sound wave vibration through sound wave technology, and the like; after the mapping relation between the commodity attribute numbers and the short sentences is determined, inputting the short sentences into a vectorization model obtained by fine tuning based on labeled electric toothbrush attribute data in advance, and outputting corresponding sentence vectors; clustering sentence vectors, for example, converging sentence vectors corresponding to short sentences with the semantics of vibration frequency reaching 31000 minutes by sound wave technology, high-speed sound wave vibration about 40000 minutes, high-efficiency sound wave vibration and the like used for expressing the vibration frequency characteristic of the electric toothbrush into corresponding cluster clusters, extracting the common attribute characteristic of commodities corresponding to the cluster clusters into high-frequency sound waves, and determining the attribute characteristic of the commodities corresponding to the attribute numbers of the commodities as the high-frequency sound waves according to the constructed mapping relation and the commodity attribute number MS12446103 of the electric toothbrush; converging sentence vectors corresponding to short sentences with the semantics of 60 hours of endurance time, 50% improvement of endurance time and the like as used for expressing the endurance time characteristics of the electric toothbrush into corresponding cluster, wherein the commodity public attribute characteristics corresponding to the cluster can be extracted as ultra-long endurance, and then determining the commodity attribute characteristics corresponding to the commodity attribute numbers as ultra-long endurance according to the constructed mapping relation and the commodity attribute numbers MS12446103 of the electric toothbrush; the number n of clusters is 2.
By adopting the commodity attribute mining method disclosed by the embodiment of the disclosure, a sentence vector which is output by a vectorization model and represents the semantics of a short sentence is obtained by determining the mapping relation between a commodity attribute number and the short sentence which is used for describing the commodity attribute in a commodity detail page picture and inputting the short sentence into a preset vectorization model; clustering the sentence vectors, and determining commodity attribute characteristics corresponding to the commodity attribute numbers based on the commodity public attribute characteristics corresponding to the clustering clusters of the sentence vectors and the mapping relation; the vectorization model is obtained by performing transfer learning training on a short sentence set of labeled attribute data corresponding to the target commodity class on the basis of a pre-training network model. The vectorization is established in a transfer learning mode, marked short sentence information can be fully utilized, so that the vectorization model effect is more suitable for the current commodity class, the unstructured data is effectively analyzed, the hidden attribute characteristics of the commodity are fully excavated, and the commodity attribute excavation efficiency is improved.
Corresponding to the commodity attribute mining method, the commodity attribute mining method also provides a commodity attribute mining device. Since the embodiment of the device is similar to the above method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the description of the above method embodiment, and the embodiment of the commodity attribute mining device described below is only illustrative. Fig. 3 is a schematic structural diagram of a product attribute mining device according to an embodiment of the present disclosure.
The commodity attribute excavating device specifically comprises the following parts:
a mapping relation determining unit 301, configured to determine a mapping relation between the product attribute number and a short sentence in the product detail page picture for describing the product attribute.
Specifically, the product attribute number, i.e., the inventory management information, refers to a numeric code or an alphabetic code assigned to a product for uniquely identifying the product attribute, so that an enterprise can manage inventory more easily and efficiently. The commodity detail page picture is a picture containing commodity detail information on an online sales counter, and the commodity detail information is unstructured information. For each item, in addition to the structured information in the database describing the characteristics of the item, some unstructured information is available, the most important of which is the information in the item detail page picture. Each commodity attribute number corresponds to at least one commodity detail page picture, each commodity detail page picture comprises a plurality of advertisement sentences or explanatory sentences, and each sentence describes the commodity attribute characteristics of a certain dimensionality of a commodity. The short sentence is the advertisement sentences or the explanatory sentences, and each commodity detail page picture corresponds to at least one short sentence describing the commodity attribute.
In a specific implementation process, the mapping relationship determining unit 301 first needs to map the product detail page picture to the product attribute number to obtain an initial mapping relationship between the product attribute number and the product detail page picture. Further, the mapping relationship determining unit 301 identifies the text information of the product detail page picture, obtains a short sentence corresponding to the product detail page picture and describing the product attribute, and determines the mapping relationship among the product attribute number, the product detail page picture and the short sentence describing the product attribute according to the initial mapping relationship and the short sentence corresponding to the product detail page picture and describing the product attribute.
A vectorization processing unit 302, configured to input the short sentence into a preset vectorization model, and obtain a sentence vector that represents the semantic meaning of the short sentence and is output by the vectorization model.
The vectorization model is obtained by training a short sentence set of labeled attribute data corresponding to the target commodity class on the basis of a pre-training network model;
and a commodity attribute feature determining unit 303, configured to determine, based on the commodity public attribute feature corresponding to the cluster of the sentence vector and the mapping relationship, a commodity attribute feature corresponding to the commodity attribute number.
Further, the commodity attribute feature determining unit is specifically configured to:
clustering the sentence vectors to obtain corresponding clustering clusters;
extracting commodity attribute features of the short sentence commonalities in the clustering cluster, and taking the commodity attribute features of the commonalities as the commodity public attribute features corresponding to the clustering cluster;
and matching the commodity public attribute characteristics corresponding to the cluster with the mapping relation to obtain the commodity attribute characteristics of the commodity attribute numbers.
Further, the clustering the sentence vectors to obtain corresponding clustering clusters specifically includes: based on the semantics of the sentence corresponding to the sentence vector, clustering the sentence vector by using a K-means clustering model to obtain a clustering cluster containing a plurality of sentence vectors; and the semantics of the short sentences corresponding to the sentence vectors in the clustering cluster meet a preset semantic approximate condition.
Further, the mapping relationship determining unit is specifically configured to:
mapping the commodity detail page picture to the commodity attribute number to obtain an initial mapping relation between the commodity attribute number and the commodity detail page picture;
recognizing the text information of the commodity detail page picture by using an optical character recognition mode to obtain a short sentence which is corresponding to the commodity detail page picture and describes the commodity attribute;
and determining the mapping relation according to the initial mapping relation and a short sentence which is corresponding to the commodity detail page picture and describes the commodity attribute.
Further, the product attribute mining device further includes:
a sample set determining unit, configured to gather, in a space, sentence vectors corresponding to similar phrases in the phrases based on a preset triple loss function, and to keep away, in the space, sentence vectors corresponding to non-similar phrases in the phrases, and to add corresponding pseudo labels to sample sentences not labeled with attribute data in the phrases according to spatial distances, so as to determine a training sample set of the pre-training network model;
and the model training unit is used for training the pre-training network model based on the training sample set.
Further, the commodity attribute number corresponds to at least one commodity detail page picture, and the commodity detail page picture corresponds to at least one short sentence describing the commodity attribute.
Further, the product attribute number is stock management information of the product, and the stock management information refers to a numeric code or an alphabetic code for uniquely identifying the product.
By adopting the commodity attribute mining device disclosed by the embodiment of the disclosure, a sentence vector which is output by a vectorization model and represents the semantics of a short sentence is obtained by determining the mapping relation between a commodity attribute number and the short sentence which is used for describing the commodity attribute in a commodity detail page picture and inputting the short sentence into a preset vectorization model; clustering the sentence vectors, and determining commodity attribute characteristics corresponding to the commodity attribute numbers based on the commodity public attribute characteristics corresponding to the clustering clusters of the sentence vectors and the mapping relation; the vectorization model is obtained by performing transfer learning training on a short sentence set of labeled attribute data corresponding to the target commodity class on the basis of a pre-training network model. The vectorization is established in a transfer learning mode, marked short sentence information can be fully utilized, so that the vectorization model effect is more suitable for the current commodity class, the unstructured data is effectively analyzed, the hidden attribute characteristics of the commodity are fully excavated, and the commodity attribute excavation efficiency is improved.
Corresponding to the commodity attribute mining method, the electronic equipment is further provided. Since the embodiment of the electronic device is similar to the above method embodiment, the description is simple, and please refer to the description of the above method embodiment, and the electronic device described below is only schematic. Fig. 4 is a schematic physical structure diagram of an electronic device disclosed in the embodiment of the present disclosure. The electronic device may include: a processor (processor)401, a memory (memory)402 and a communication bus 403, wherein the processor 401 and the memory 402 communicate with each other through the communication bus 403 and communicate with the outside through a communication interface 404. Processor 401 may invoke logic instructions in memory 402 to perform a commodity property mining method comprising: determining a mapping relation between the commodity attribute number and a short sentence describing the commodity attribute; wherein the short sentence is in the commodity detail page picture; inputting the short sentence into a preset vectorization model to obtain a sentence vector which is output by the vectorization model and represents the semantics of the short sentence; the vectorization model is obtained by training a short sentence set of labeled attribute data corresponding to the target commodity class on the basis of a pre-training network model; and determining the commodity attribute characteristics corresponding to the commodity attribute numbers based on the commodity public attribute characteristics corresponding to the cluster of the sentence vectors and the mapping relation.
Furthermore, the logic instructions in the memory 402 may be implemented in software functional units and stored in a computer readable storage medium when sold or used as a stand-alone product. Based on such understanding, the technical solution of the present disclosure may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present disclosure. And the aforementioned storage medium includes: various media capable of storing program codes, such as a Memory chip, a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
In another aspect, the present disclosure also provides a computer program product, which includes a computer program stored on a processor-readable storage medium, where the computer program includes program instructions, and when the program instructions are executed by a computer, the computer can execute the method for mining the property of the commodity provided by the above-mentioned method embodiments. The method comprises the following steps: determining a mapping relation between the commodity attribute number and a short sentence describing the commodity attribute; wherein the short sentence is in the commodity detail page picture; inputting the short sentence into a preset vectorization model to obtain a sentence vector which is output by the vectorization model and represents the semantics of the short sentence; the vectorization model is obtained by training a short sentence set of labeled attribute data corresponding to the target commodity class on the basis of a pre-training network model; and determining the commodity attribute characteristics corresponding to the commodity attribute numbers based on the commodity public attribute characteristics corresponding to the cluster of the sentence vectors and the mapping relation.
In another aspect, the disclosed embodiments also provide a processor-readable storage medium, where a computer program is stored on the processor-readable storage medium, and when executed by a processor, the computer program is implemented to perform the commodity attribute mining method provided by each of the above embodiments. The method comprises the following steps: determining a mapping relation between the commodity attribute number and a short sentence describing the commodity attribute; wherein the short sentence is in the commodity detail page picture; inputting the short sentence into a preset vectorization model to obtain a sentence vector which is output by the vectorization model and represents the semantics of the short sentence; the vectorization model is obtained by training a short sentence set of labeled attribute data corresponding to the target commodity class on the basis of a pre-training network model; and determining the commodity attribute characteristics corresponding to the commodity attribute numbers based on the commodity public attribute characteristics corresponding to the cluster of the sentence vectors and the mapping relation.
The processor-readable storage medium can be any available medium or data storage device that can be accessed by a processor, including, but not limited to, magnetic memory (e.g., floppy disks, hard disks, magnetic tape, magneto-optical disks (MOs), etc.), optical memory (e.g., CDs, DVDs, BDs, HVDs, etc.), and semiconductor memory (e.g., ROMs, EPROMs, EEPROMs, non-volatile memory (NAND FLASH), Solid State Disks (SSDs)), etc.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solutions of the present disclosure, not to limit them; although the present disclosure has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present disclosure.
Claims (10)
1. A commodity attribute mining method is characterized by comprising the following steps:
determining a mapping relation between the commodity attribute number and a short sentence describing the commodity attribute; wherein the short sentence is in the commodity detail page picture;
inputting the short sentence into a preset vectorization model to obtain a sentence vector which is output by the vectorization model and represents the semantics of the short sentence;
the vectorization model is obtained by training a short sentence set of labeled attribute data corresponding to the target commodity class on the basis of a pre-training network model;
and determining the commodity attribute characteristics corresponding to the commodity attribute numbers based on the commodity public attribute characteristics corresponding to the cluster of the sentence vectors and the mapping relation.
2. The product attribute mining method according to claim 1, wherein determining the product attribute features corresponding to the product attribute numbers based on the product public attribute features corresponding to the cluster of the sentence vectors and the mapping relationship specifically comprises:
clustering the sentence vectors to obtain corresponding clustering clusters;
extracting commodity attribute features of the short sentence commonalities in the clustering cluster, and taking the commodity attribute features of the commonalities as the commodity public attribute features corresponding to the clustering cluster;
and matching the commodity public attribute characteristics corresponding to the cluster with the mapping relation to obtain the commodity attribute characteristics of the commodity attribute numbers.
3. The product attribute mining method according to claim 2, wherein the clustering the sentence vectors to obtain corresponding cluster specifically comprises:
based on the semantics of the sentence corresponding to the sentence vector, clustering the sentence vector by using a K-means clustering model to obtain a clustering cluster containing a plurality of sentence vectors; and the semantics of the short sentences corresponding to the sentence vectors in the clustering cluster meet a preset semantic approximate condition.
4. The product attribute mining method according to claim 1, wherein the determining of the mapping relationship between the product attribute number and the short sentence describing the product attribute specifically comprises:
mapping the commodity detail page picture to the commodity attribute number to obtain an initial mapping relation between the commodity attribute number and the commodity detail page picture;
recognizing the text information of the commodity detail page picture by using an optical character recognition mode to obtain a short sentence which is corresponding to the commodity detail page picture and describes the commodity attribute;
and determining the mapping relation between the determined commodity attribute number and the short sentence according to the initial mapping relation and the short sentence which is corresponding to the commodity detail page picture and describes the commodity attribute.
5. The product attribute mining method of claim 1, further comprising: based on a preset triple loss function, gathering sentence vectors corresponding to similar phrases in the phrases in a space, keeping away the sentence vectors corresponding to non-similar phrases in the space, and adding corresponding pseudo marks according to the spatial distance for sample sentences not labeled with attribute data in the phrases so as to determine a training sample set of the pre-training network model;
training the pre-training network model based on the training sample set.
6. The product attribute mining method of claim 1, further comprising: and training the pre-training network model by using a transfer learning mode based on the short sentence set marked with the attribute data to obtain the vectorization model.
7. The product attribute mining method according to any one of claims 1 to 6, wherein the product attribute number is stock management information of a product, and the stock management information refers to a numeric code or an alphabetic code for uniquely identifying the product.
8. An article attribute excavating device characterized by comprising:
the mapping relation determining unit is used for determining the mapping relation between the commodity attribute number and the short sentence describing the commodity attribute; wherein the short sentence is in the commodity detail page picture;
the vectorization processing unit is used for inputting the short sentence into a preset vectorization model to obtain a sentence vector which is output by the vectorization model and represents the semantic meaning of the short sentence;
the vectorization model is obtained by training a short sentence set of labeled attribute data corresponding to the target commodity class on the basis of a pre-training network model;
and the commodity attribute characteristic determining unit is used for determining the commodity attribute characteristics corresponding to the commodity attribute numbers based on the commodity public attribute characteristics corresponding to the clustering clusters of the sentence vectors and the mapping relation.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program implements the steps of the method of mining the attributes of a commodity according to any one of claims 1 to 7.
10. A processor-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method for mining properties of an item of merchandise according to any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111076600.2A CN113724055B (en) | 2021-09-14 | 2021-09-14 | Commodity attribute mining method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111076600.2A CN113724055B (en) | 2021-09-14 | 2021-09-14 | Commodity attribute mining method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113724055A true CN113724055A (en) | 2021-11-30 |
CN113724055B CN113724055B (en) | 2024-04-09 |
Family
ID=78683691
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111076600.2A Active CN113724055B (en) | 2021-09-14 | 2021-09-14 | Commodity attribute mining method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113724055B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114169966A (en) * | 2021-12-08 | 2022-03-11 | 海南港航控股有限公司 | Method and system for extracting unit data of goods by tensor |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2011227886A (en) * | 2010-03-30 | 2011-11-10 | Rakuten Inc | Commodity information providing system, commodity information providing method, and program |
CN106067132A (en) * | 2016-05-27 | 2016-11-02 | 乐视控股(北京)有限公司 | The method to set up of item property and device |
CN106408321A (en) * | 2015-07-31 | 2017-02-15 | 华为技术有限公司 | Management method and device of commodity template, and method and device for calling database, and system |
CN107203548A (en) * | 2016-03-17 | 2017-09-26 | 阿里巴巴集团控股有限公司 | Attribute acquisition methods and device |
CN107679247A (en) * | 2017-10-31 | 2018-02-09 | 南威软件股份有限公司 | A kind of method that electric business website realizes self-defined maintenance items extension information |
CN109670066A (en) * | 2018-12-11 | 2019-04-23 | 江西师范大学 | A kind of Freehandhand-drawing formula toggery image search method based on dual path Deep Semantics network |
CN110347835A (en) * | 2019-07-11 | 2019-10-18 | 招商局金融科技有限公司 | Text Clustering Method, electronic device and storage medium |
CN110490682A (en) * | 2018-05-15 | 2019-11-22 | 北京京东尚科信息技术有限公司 | The method and apparatus for analyzing item property |
CN111401409A (en) * | 2020-02-28 | 2020-07-10 | 创新奇智(青岛)科技有限公司 | Commodity brand feature acquisition method, sales volume prediction method, device and electronic equipment |
CN113065882A (en) * | 2020-01-02 | 2021-07-02 | 阿里巴巴集团控股有限公司 | Commodity processing method and device and electronic equipment |
-
2021
- 2021-09-14 CN CN202111076600.2A patent/CN113724055B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2011227886A (en) * | 2010-03-30 | 2011-11-10 | Rakuten Inc | Commodity information providing system, commodity information providing method, and program |
CN106408321A (en) * | 2015-07-31 | 2017-02-15 | 华为技术有限公司 | Management method and device of commodity template, and method and device for calling database, and system |
CN107203548A (en) * | 2016-03-17 | 2017-09-26 | 阿里巴巴集团控股有限公司 | Attribute acquisition methods and device |
CN106067132A (en) * | 2016-05-27 | 2016-11-02 | 乐视控股(北京)有限公司 | The method to set up of item property and device |
CN107679247A (en) * | 2017-10-31 | 2018-02-09 | 南威软件股份有限公司 | A kind of method that electric business website realizes self-defined maintenance items extension information |
CN110490682A (en) * | 2018-05-15 | 2019-11-22 | 北京京东尚科信息技术有限公司 | The method and apparatus for analyzing item property |
CN109670066A (en) * | 2018-12-11 | 2019-04-23 | 江西师范大学 | A kind of Freehandhand-drawing formula toggery image search method based on dual path Deep Semantics network |
CN110347835A (en) * | 2019-07-11 | 2019-10-18 | 招商局金融科技有限公司 | Text Clustering Method, electronic device and storage medium |
CN113065882A (en) * | 2020-01-02 | 2021-07-02 | 阿里巴巴集团控股有限公司 | Commodity processing method and device and electronic equipment |
CN111401409A (en) * | 2020-02-28 | 2020-07-10 | 创新奇智(青岛)科技有限公司 | Commodity brand feature acquisition method, sales volume prediction method, device and electronic equipment |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114169966A (en) * | 2021-12-08 | 2022-03-11 | 海南港航控股有限公司 | Method and system for extracting unit data of goods by tensor |
CN114169966B (en) * | 2021-12-08 | 2022-08-05 | 海南港航控股有限公司 | Method and system for extracting unit data of goods by tensor |
Also Published As
Publication number | Publication date |
---|---|
CN113724055B (en) | 2024-04-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108399228B (en) | Article classification method and device, computer equipment and storage medium | |
US8682896B2 (en) | Smart attribute classification (SAC) for online reviews | |
CN112487149B (en) | Text auditing method, model, equipment and storage medium | |
CN111339260A (en) | BERT and QA thought-based fine-grained emotion analysis method | |
CN111783993A (en) | Intelligent labeling method and device, intelligent platform and storage medium | |
CN112364628B (en) | New word recognition method and device, electronic equipment and storage medium | |
CN111782793A (en) | Intelligent customer service processing method, system and equipment | |
CN111462752A (en) | Client intention identification method based on attention mechanism, feature embedding and BI-L STM | |
CN113010678A (en) | Training method of classification model, text classification method and device | |
CN115600109A (en) | Sample set optimization method and device, equipment, medium and product thereof | |
JP7291419B2 (en) | Method and apparatus for providing information about machine learning-based similar items | |
CN115481355A (en) | Data modeling method based on category expansion | |
CN111651597A (en) | Multi-source heterogeneous commodity information classification method based on Doc2Vec and convolutional neural network | |
CN109409529B (en) | Event cognitive analysis method, system and storage medium | |
CN113869609A (en) | Method and system for predicting confidence of frequent subgraph of root cause analysis | |
CN113724055B (en) | Commodity attribute mining method and device | |
CN117034948B (en) | Paragraph identification method, system and storage medium based on multi-feature self-adaptive fusion | |
CN117252186A (en) | XAI-based information processing method, device, equipment and storage medium | |
CN114970553B (en) | Information analysis method and device based on large-scale unmarked corpus and electronic equipment | |
CN115168567B (en) | Knowledge graph-based object recommendation method | |
CN111126038A (en) | Information acquisition model generation method and device and information acquisition method and device | |
CN116127013A (en) | Personal sensitive information knowledge graph query method and device | |
CN115953217A (en) | Commodity grading recommendation method and device, equipment, medium and product thereof | |
CN115660695A (en) | Customer service personnel label portrait construction method and device, electronic equipment and storage medium | |
CN112948561B (en) | Method and device for automatically expanding question-answer knowledge base |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |